fly51fly Profile Banner
fly51fly Profile
fly51fly

@fly51fly

6,355
Followers
2,167
Following
9,352
Media
23,880
Statuses

BUPT prof | Sharing latest AI papers & insights | Join me in embracing the AI revolution! #MachineLearning #AI #Innovation

Joined February 2009
Don't wanna be here? Send us removal request.
@fly51fly
fly51fly
9 months
[LG] A mathematical perspective on Transformers A Mathematical Perspective on Transformers: This paper presents a mathematical framework for analyzing Transformers, focusing on their interpretation as interacting particle systems. The study reveals
Tweet media one
0
50
291
@fly51fly
fly51fly
9 months
[LG] Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks This paper provides a mathematical explanation for the emergence of Fourier features in learning systems, such as neural networks. It demonstrates that if a neural
Tweet media one
3
47
227
@fly51fly
fly51fly
5 months
[CL] Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing Y Tian, B Peng, L Song, L Jin, D Yu, H Mi, D Yu [Tencent AI Lab] (2024) - LLMs still struggle with complex reasoning and planning tasks. Advanced prompting and fine-tuning
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
35
206
@fly51fly
fly51fly
7 months
[LG] How Transformers Learn Causal Structure with Gradient Descent E Nichani, A Damian, J D. Lee [Princeton University] (2024) - The paper studies how transformers learn causal structure through gradient descent when trained on a novel in-context learning
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
49
173
@fly51fly
fly51fly
2 months
[LG] Mixture of A Million Experts X O He [Google DeepMind] (2024) - The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. - Sparse
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
56
159
@fly51fly
fly51fly
9 months
[LG] A Mathematical Guide to Operator Learning Operator learning aims to discover properties of an underlying dynamical system or partial differential equation (PDE) from data. Here, we present a step-by-step guide to operator learning. We explain the
Tweet media one
4
37
150
@fly51fly
fly51fly
6 months
[CL] Tokenization Is More Than Compression The paper challenges a widely accepted notion in Natural Language Processing (NLP) that fewer tokens result in better downstream task performance. The authors introduce a new tokenizer, PathPiece, designed to
Tweet media one
Tweet media two
1
38
148
@fly51fly
fly51fly
15 days
[IR] Meta Knowledge for Retrieval Augmented Large Language Models L Mombaerts, T Ding, A Banerjee, F Felice... [Amazon Web Services] (2024) - The paper introduces a novel data-centric workflow for retrieval augmented large language models (RAG),
Tweet media one
Tweet media two
Tweet media three
0
34
146
@fly51fly
fly51fly
4 months
[LG] Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory X Niu, B Bai, L Deng, W Han [Huawei Technologies Co., Ltd] (2024) - Increasing model size does not always improve performance. The scaling laws do not explain this
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
29
125
@fly51fly
fly51fly
2 months
[CL] A Closer Look into Mixture-of-Experts in Large Language Models - Experts act like fine-grained neurons. The gate embedding determines expert selection while the gate projection matrix controls neuron activation. Their heatmaps are correlated,
Tweet media one
Tweet media two
Tweet media three
3
38
127
@fly51fly
fly51fly
6 months
[LG] Fine-tuning with Very Large Dropout J Zhang, L Bottou [New York University & Facebook AI Research] (2024) - It is common to pre-train models on large datasets and then fine-tune them on smaller datasets for specific tasks. This violates the
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
34
120
@fly51fly
fly51fly
1 year
[LG] ExpeL: LLM Agents Are Experiential Learners A Zhao, D Huang, Q Xu, M Lin, Y Liu, G Huang [Tsinghua University] (2023) - ExpeL is an autonomous agent that learns from experiences without parameter updates, compatible with proprietary LLMs. - During
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
33
116
@fly51fly
fly51fly
6 months
[CL] Understanding Emergent Abilities of Language Models from the Loss Perspective Z Du, A Zeng, Y Dong, J Tang [Zhipu AI & Tsinghua University] (2024) - Recent studies question if emergent abilities in LMs are exclusive to large models, as smaller models
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
26
117
@fly51fly
fly51fly
6 months
[LG] Mechanics of Next Token Prediction with Self-Attention Y Li, Y Huang, M. E Ildiz, A S Rawat, S Oymak [University of Michigan & Google Research NYC] (2024) - Self-attention learns to retrieve high-priority tokens from the input sequence and creates a
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
34
114
@fly51fly
fly51fly
2 years
[LG] Training trajectories, mini-batch losses and the curious role of the learning rate M Sandler, A Zhmoginov, M Vladymyrov, N Miller [Google Research] (2023) #MachineLearning #ML #AI
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
19
113
@fly51fly
fly51fly
5 months
[LG] An Overview of Diffusion Models: Applications, Guided Generation, Statistical Rates and Optimization - Diffusion models have achieved tremendous success in generative modeling and sample generation in computer vision, audio, reinforcement
Tweet media one
Tweet media two
Tweet media three
1
25
114
@fly51fly
fly51fly
2 years
[LG] Riemannian Flow Matching on General Geometries R T. Q. Chen, Y Lipman [Meta AI] (2023) #MachineLearning #ML #AI
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
18
106
@fly51fly
fly51fly
4 months
[LG] DPO Meets PPO: Reinforced Token Optimization for RLHF - This paper models RLHF as an MDP, offering a token-wise characterization of LLM's generation process. It theoretically demonstrates advantages of token-wise MDP over sentence-wise bandit
Tweet media one
Tweet media two
Tweet media three
0
38
107
@fly51fly
fly51fly
4 months
[LG] FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information D Hwang [Google LLC] (2024) - Adam optimizer can be considered as an approximation of natural gradient descent through the use of diagonal empirical Fisher
Tweet media one
Tweet media two
2
34
99
@fly51fly
fly51fly
5 months
[CL] Toward a Theory of Tokenization in LLMs N Rajaraman, J Jiao, K Ramchandran [UC Berkeley] (2024) - Transformers trained on data from certain simple high-order Markov processes fail to learn the underlying distribution and instead predict characters
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
29
100
@fly51fly
fly51fly
1 year
[CV] Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion J Tian, L Aggarwal, A Colaco, Z Kira, M Gonzalez-Franco [Georgia Institute of Technology & Google] (2023) - Proposes DiffSeg, an unsupervised zero-shot
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
26
99
@fly51fly
fly51fly
8 months
[CL] Large Language Models for Generative Information Extraction: A Survey Application of Large Language Models (LLMs) in generative information extraction. The study explores the remarkable capabilities of LLMs in text understanding and generation,
Tweet media one
Tweet media two
1
27
98
@fly51fly
fly51fly
2 years
[LG] GNNInterpreter: A Probabilistic Generative Model-Level Explanation for Graph Neural Networks (2023) #MachineLearning #ML #AI [1/3]
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
17
96
@fly51fly
fly51fly
7 months
[LG] Deep Networks Always Grok and Here is Why A I Humayun, R Balestriero, R Baraniuk [Rice University] (2024) - Grokking, where generalization occurs long after achieving near zero training error, is more widespread than previously thought. It occurs not
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
23
97
@fly51fly
fly51fly
9 months
[CL] LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment S Dou, E Zhou, Y Liu, S Gao, J Zhao... [Fudan University] (2023) - Vanilla supervised fine-tuning (SFT) with a massive amount of instruction data
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
30
97
@fly51fly
fly51fly
2 years
[LG] A Survey of Geometric Optimization for Deep Learning: From Euclidean Space to Riemannian Manifold Y Fei, X Wei, Y Liu, Z Li, M Chen [East China Normal University] (2023) #MachineLearning #ML #AI
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
23
94
@fly51fly
fly51fly
5 months
[LG] Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models B Pan, Y Shen, H Liu, M Mishra, G Zhang, A Oliva, C Raffel, R Panda [MIT & University of Toronto] (2024) - MoE models usually require 2-4x more parameters
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
32
94
@fly51fly
fly51fly
2 years
[CL] Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval J Wieting, J H. Clark, W W. Cohen, G Neubig... [Google Research & CMU & University of California San Diego, San Diego] (2022) #MachineLearning #ML #AI #NLP #NLProc
Tweet media one
Tweet media two
Tweet media three
2
17
93
@fly51fly
fly51fly
2 years
[CL] Diffusion-LM Improves Controllable Text Generation X L Li, J Thickstun, I Gulrajani, P Liang, T B. Hashimoto [Stanford University] (2022) #MachineLearning #ML #AI #NLP #NLProc
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
18
90
@fly51fly
fly51fly
6 months
[LG] Model Stock: All we need is just a few fine-tuned models D Jang, S Yun, D Han [NAVER AI Lab] (2024) - Fine-tuned models with different random seeds lie on a thin shell in weight space layer-wise. - Proximity of weights to the center of this thin
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
26
91
@fly51fly
fly51fly
2 years
[LG] Unsupervised Manifold Linearizing and Clustering T Ding, S Tong, K H R Chan, X Dai, Y Ma, B D. Haeffele [Johns Hopkins University & UC Berkeley] (2023) #MachineLearning #ML #AI #CV
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
19
89
@fly51fly
fly51fly
8 months
[CV] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs S Tong, Z Liu, Y Zhai, Y Ma, Y LeCun, S Xie [New York University & UC Berkeley] (2024) - The paper finds that multimodal large language models (MLLMs) still exhibit systematic
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
20
89
@fly51fly
fly51fly
5 months
[CL] Mixture of LoRA Experts X Wu, S Huang, F Wei [Microsoft Research Asia] (2024) - LoRA (Low-Rank Adaptation) is an effective technique for fine-tuning large pre-trained models. Composing multiple trained LoRAs can enhance performance across tasks. -
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
29
88
@fly51fly
fly51fly
6 months
[CL] The pitfalls of next-token prediction G Bachmann, V Nagarajan [ETH Zürich & Google Research] (2024) - Next-token prediction has become a core part of modern language models, but there is a growing belief that it cannot truly model human thought and
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
19
86
@fly51fly
fly51fly
2 months
[CL] Tree Search for Language Model Agents J Y Koh, S McAleer, D Fried, R Salakhutdinov [CMU] (2024) - This paper proposes a search algorithm for language model agents to improve performance on multi-step web automation tasks. Language models struggle
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
31
86
@fly51fly
fly51fly
1 month
[CV] Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names R Sachdeva, G Shin, A Zisserman [University of Oxford] (2024) - The paper introduces Magiv2, a model that can generate high-quality chapter-wide manga transcripts with named
Tweet media one
Tweet media two
Tweet media three
Tweet media four
9
48
83
@fly51fly
fly51fly
4 months
[LG] Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers - Common transformer architectures like BERT and GPT-2 structurally cannot represent additive models like linear models or generalized additive
Tweet media one
Tweet media two
Tweet media three
0
34
84
@fly51fly
fly51fly
1 month
[LG] A Survey of Mamba - Mamba is an emerging deep learning architecture that has demonstrated remarkable success across diverse domains due to its powerful modeling capabilities and computational efficiency. - Mamba is inspired by classical
Tweet media one
Tweet media two
Tweet media three
0
27
84
@fly51fly
fly51fly
6 months
[LG] Masked Autoencoders are PDE Learners A Zhou, A B Farimani [CMU] (2024) - Masked autoencoders can learn useful latent representations for PDEs through self-supervised pretraining on unlabeled spatiotemporal data. This allows them to improve
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
25
85
@fly51fly
fly51fly
4 months
[LG] A Survey on the Memory Mechanism of Large Language Model based Agents - The memory module is a key component that differentiates agents from original large language models (LLMs), enabling agent-environment interactions. - Memory serves
Tweet media one
Tweet media two
Tweet media three
1
36
84
@fly51fly
fly51fly
11 months
[LG] Riemannian Residual Neural Networks I Katsman, E M Chen, S Holalkere, A Asch, A Lou, S Lim, C D Sa [Yale University & Cornell University & Stanford University & University of Central Florida] (2023) - The paper proposes a novel and principled
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
22
84
@fly51fly
fly51fly
6 months
[CL] Latent Attention for Linear Time Transformers The paper introduces the "Latte Transformer," a new transformer model that reduces the time complexity of the attention mechanism from quadratic to linear by employing latent vectors. This approach
Tweet media one
Tweet media two
0
20
82
@fly51fly
fly51fly
3 months
[LG] Learning Iterative Reasoning through Energy Diffusion Y Du, J Mao, J B. Tenenbaum [MIT] (2024) - This paper introduces iterative reasoning through energy diffusion (IRED), a novel framework for learning to reason for various tasks by formulating
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
24
79
@fly51fly
fly51fly
3 months
[CL] What Do Language Models Learn in Context? The Structured Task Hypothesis - The paper examines three hypotheses on how LLMs perform in-context learning: task selection, meta-learning, and structured task composition. - It invalidates task
Tweet media one
Tweet media two
Tweet media three
0
18
79
@fly51fly
fly51fly
6 months
[CL] Datasets for Large Language Models: A Comprehensive Survey The survey delves into the significance of datasets for Large Language Models (LLMs) and their foundational role in LLM advancement. It emphasizes that high-quality datasets are essential
Tweet media one
Tweet media two
Tweet media three
1
18
79
@fly51fly
fly51fly
5 months
[LG] Gradient Networks S Chaudhari, S Pranav, J M. F. Moura [CMU] (2024) - The paper introduces gradient networks (GradNets) to directly parameterize gradients of arbitrary functions. GradNets have architectural constraints to ensure they correspond to
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
23
78
@fly51fly
fly51fly
9 months
[CL] ReLoRA: High-Rank Training Through Low-Rank Updates V Lialin, N Shivagunde, S Muckatira, A Rumshisky [University of Massachusetts Lowell] (2023) - Motivation: Understanding why overparametrized models with billions of parameters are needed for
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
17
78
@fly51fly
fly51fly
6 months
[LG] Understanding Diffusion Models by Feynman's Path Integral The paper introduces a novel approach to understanding score-based diffusion models by employing Feynman's path integral formalism, traditionally used in quantum physics. It provides a
Tweet media one
Tweet media two
Tweet media three
3
20
77
@fly51fly
fly51fly
14 days
[LG] A Law of Next-Token Prediction in Large Language Models H He, W J. Su [University of Rocheste & University of Pennsylvania] (2024)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
20
78
@fly51fly
fly51fly
2 months
[LG] How to Boost Any Loss Function R Nock, Y Mansour [Google Research] (2024) - The paper presents a boosting algorithm that can optimize essentially any loss function without using first-order information. This is in contrast to traditional boosting
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
19
76
@fly51fly
fly51fly
8 months
[CL] MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining J Portes, A Trott, S Havens, D King, A Venigalla, M Nadeem, N Sardana, D Khudia, J Frankle [MosaicML × Databricks] (2023) - Proposes MosaicBERT, a modified BERT architecture optimized
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
17
73
@fly51fly
fly51fly
6 months
[CL] How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning S Dutta, J Singh, S Chakrabarti, T Chakraborty [IIT Delhi & IIT Bombay] (2024) - Despite superior reasoning ability with Chain-of-Thought (CoT) prompting, the
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
30
77
@fly51fly
fly51fly
4 months
[CL] Text Quality-Based Pruning for Efficient Training of Language Models V Sharma, K Padthe, N Ardalani, K Tirumala… [FAIR, Meta] (2024) - The paper proposes a novel method to numerically evaluate text quality in large unlabeled NLP datasets in a
Tweet media one
Tweet media two
1
18
75
@fly51fly
fly51fly
17 days
[LG] Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations R Csordás, C Potts, C D. Manning, A Geiger [Stanford University] (2024) - RNNs learn non-linear representations to store and generate sequences,
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
19
76
@fly51fly
fly51fly
7 months
[LG] Learning a Decision Tree Algorithm with Transformers Y Zhuang, L Liu, C Singh, J Shang, J Gao [Microsoft Research & UC San Diego] (2024) - The paper proposes MetaTree, a transformer-based model that learns to generate decision trees. - MetaTree is
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
17
73
@fly51fly
fly51fly
5 months
[CL] sDPO: Don't Use Your Data All at Once D Kim, Y Kim, W Song, H Kim, Y Kim, S Kim, C Park [Upstage AI] (2024) - They propose stepwise DPO (sDPO), which divides available preference datasets and uses them step-by-step, instead of all at once like
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
22
73
@fly51fly
fly51fly
2 months
[CL] Transformer Layers as Painters Q Sun, M Pickett, A K Nain, L Jones [Emergence AI] (2024) - The paper investigates the internal workings of transformer layers in large pre-trained language models by conducting empirical studies on frozen models. -
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
19
74
@fly51fly
fly51fly
9 months
[LG] Chain-of-Thought Reasoning is a Policy Improvement Operator
Tweet media one
2
18
74
@fly51fly
fly51fly
5 months
[LG] The Topos of Transformer Networks M J Villani, P McBurney [King’s College London] (2024) - The transformer neural network architecture has achieved great success in natural language processing tasks. This paper provides a theoretical analysis of the
Tweet media one
Tweet media two
Tweet media three
3
19
75
@fly51fly
fly51fly
2 years
[LG] What Do We Maximize in Self-Supervised Learning And Why Does Generalization Emerge? R Shwartz-Ziv, R Balestriero, K Kawaguchi, Y LeCun [New York University & Facebook] (2023) #MachineLearning #ML #AI
Tweet media one
Tweet media two
0
22
75
@fly51fly
fly51fly
1 year
[CL] Instruction Tuning for Large Language Models: A Survey S Zhang, L Dong, X Li, S Zhang, X Sun, S Wang, J Li, R Hu, T Zhang, F Wu, G Wang [Zhejiang University & ] (2023) - Instruction tuning (IT) is a crucial technique to enhance
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
25
74
@fly51fly
fly51fly
6 months
[LG] Denoising Autoregressive Representation Learning Y Li, J Bornschein, T Chen [Google DeepMind] (2024) - The paper proposes Denoising Autoregressive Representation Learning (DARL), which uses a decoder-only Transformer to predict image patches
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
20
69
@fly51fly
fly51fly
3 months
[CL] Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? G Yona, R Aharoni, M Geva [Google Research] (2024) - The paper proposes the notion of "faithful response uncertainty", which measures how well a language model
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
22
71
@fly51fly
fly51fly
7 months
[CL] A Language Model's Guide Through Latent Space D v Rütte, S Anagnostidis, G Bachmann, T Hofmann [ETH Zurich] (2024) - The paper systematically compares linear probing and guidance techniques on hidden representations in large language models (LLMs)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
28
70
@fly51fly
fly51fly
2 months
[LG] Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think L Sernau [Google DeepMind] (2024) - Infinite width models like NTKs historically underperform finite models. This is attributed to lack of feature learning. -
Tweet media one
Tweet media two
2
28
72
@fly51fly
fly51fly
2 months
[LG] A Survey on LoRA of Large Language Models - LoRA is an effective parameter efficient fine-tuning paradigm that updates dense layers of neural networks with pluggable low-rank matrices. - LoRA is computationally efficient and provides good
Tweet media one
Tweet media two
Tweet media three
0
29
71
@fly51fly
fly51fly
3 months
[LG] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality T Dao, A Gu [Princeton University & CMU] (2024) - This paper shows theoretical connections between structured state space models (SSMs),
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
19
70
@fly51fly
fly51fly
2 months
[LG] Information-Theoretic Foundations for Machine Learning - The paper provides a mathematical framework using Bayesian statistics and information theory to analyze the fundamental limits of machine learning performance. - It establishes a
Tweet media one
Tweet media two
Tweet media three
0
22
70
@fly51fly
fly51fly
5 months
[LG] Continual Learning of Large Language Models: A Comprehensive Survey - Large language models (LLMs) have shown promise for artificial general intelligence (AGI), but face challenges adapting to new data distributions, tasks, and users without
Tweet media one
Tweet media two
Tweet media three
0
27
72
@fly51fly
fly51fly
2 years
[LG] Dataset Distillation: A Comprehensive Review R Yu, S Liu, X Wang [National University of Singapore] (2023) #MachineLearning #ML #AI [1/2]
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
16
68
@fly51fly
fly51fly
9 months
[LG] Manifold Diffusion Fields A A. Elhag, J M. Susskind, M A Bautista [Apple] (2023) - This paper proposes a method called Manifold Diffusion Fields (MDF) to learn generative models on Riemannian manifolds. - It utilizes insights from spectral
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
15
69
@fly51fly
fly51fly
2 months
[LG] On the Anatomy of Attention N Khatri, T Laakkonen, J Liu, V Wang-Maścianica [Quantinuum] (2024) - The paper introduces a category-theoretic diagrammatic formalism to systematically relate and reason about machine learning models. - The diagrams
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
16
68
@fly51fly
fly51fly
4 months
[CL] Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations Z Ma, Z Wang, J Chai [University of Michigan] (2024) - This work introduces a trial-and-demonstration (TnD) learning framework to examine the
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
21
68
@fly51fly
fly51fly
4 months
[LG] Attention as an RNN L Feng, F Tung, H Hajimirsadeghi, M O Ahmed, Y Bengio, G Mori [Borealis AI] (2024) - Transformers brought breakthroughs in sequence modeling but are inefficient for low-resource settings due to quadratic complexity. - This work
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
20
68
@fly51fly
fly51fly
3 months
[CL] Chain of Agents: Large Language Models Collaborating on Long-Context Tasks Y Zhang, R Sun, Y Chen, T Pfister… [Google Cloud AI Research & Penn State University] (2024) - Chain-of-Agents (CoA) is a multi-agent LLM collaboration framework for solving
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
25
67
@fly51fly
fly51fly
6 months
[LG] Learning Associative Memories with Gradient Descent V Cabannes, B Simsek, A Bietti [Meta AI & Flatiron Institute] (2024) - The paper studies the training dynamics of associative memory models trained with cross-entropy loss. It shows the dynamics can
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
16
66
@fly51fly
fly51fly
2 months
[CL] AgentInstruct: Toward Generative Teaching with Agentic Flows - Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. - Despite several successful use cases, researchers
Tweet media one
Tweet media two
Tweet media three
3
25
66
@fly51fly
fly51fly
6 months
[LG] What makes an image realistic? L Theis [Google DeepMind,] (2024) - Quantifying realism remains challenging despite progress in generating realistic data. A good generative model alone is insufficient. - Probability and typicality fail to quantify
Tweet media one
1
21
66
@fly51fly
fly51fly
5 months
[CL] Let's Think Dot by Dot: Hidden Computation in Transformer Language Models J Pfau, W Merrill, S R. Bowman [New York University] (2024) - Let's Think Dot by Dot shows that transformers can use meaningless filler tokens like '......' to solve complex
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
20
64
@fly51fly
fly51fly
2 months
[LG] Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data A Antoniades, X Wang, Y Elazar, A Amayuelas, A Albalak, K Zhang, W Y Wang [Universig of Califormia, Sonta Barbara & Allen Institute for Al] (2024) -
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
14
63
@fly51fly
fly51fly
2 months
[LG] Vision language models are blind P Rahmanzadehgervi, L Bolton, M R Taesiri, A T Nguyen [Auburn University] (2024) - Large language models with vision capabilities (VLMs) are scoring high on vision-understanding benchmarks but may not actually be
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
13
63
@fly51fly
fly51fly
19 days
[CL] LLM Pruning and Distillation in Practice: The Minitron Approach S T Sreenivas, S Muralidharan, R Joshi, M Chochowski… [NVIDIA] (2024) - The paper presents a comprehensive report on compressing Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
16
63
@fly51fly
fly51fly
2 months
[CV] LookupViT: Compressing visual information to a limited number of tokens - LookupViT introduces a novel Multi-Head Bidirectional Cross-attention (MHBC) module that enables effective information flow with significant computational savings. -
Tweet media one
Tweet media two
Tweet media three
1
11
63
@fly51fly
fly51fly
6 months
[LG] AutoEval Done Right: Using Synthetic Data for Model Evaluation P Boyeau, A N. Angelopoulos, N Yosef, J Malik, M I. Jordan (2024) - Autoevaluation using AI-labeled synthetic data can significantly reduce the need for human annotations when evaluating
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
19
62
@fly51fly
fly51fly
2 months
[LG] RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold A Setlur, S Garg, X Geng, N Garg… [CMU & Google DeepMind] (2024) - Training on model-generated synthetic data is promising for finetuning LLMs on math
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
19
61
@fly51fly
fly51fly
2 months
[CL] Automata-based constraints for language model decoding T Koo, F Liu, L He [Google DeepMind] (2024) - Language models are often expected to generate strings in some formal language, but this is not guaranteed, especially with smaller LMs. Tuning
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
17
61
@fly51fly
fly51fly
4 months
[CL] A Primer on the Inner Workings of Transformer-based Language Models - The paper provides a concise technical introduction to interpretability techniques used to analyze Transformer-based language models, focusing on the generative decoder-only
Tweet media one
Tweet media two
Tweet media three
2
19
60
@fly51fly
fly51fly
6 months
[CL] Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking E Zelikman, G Harik, Y Shao, V Jayasiri, N Haber, N D. Goodman [Stanford University & Notbad AI Inc] (2024) - Quiet-STaR is a generalization of STaR that allows language models
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
17
61
@fly51fly
fly51fly
3 months
[LG] Attention as a Hypernetwork S Schug, S Kobayashi, Y Akram, J Sacramento, R Pascanu [ETH Zürich] (2024) - Transformers can generalize to novel compositional problem instances whose parts were seen during training. This paper aims to understand the
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
22
59
@fly51fly
fly51fly
3 months
[LG] A Study of Optimizations for Fine-tuning Large Language Models - Fine-tuning large language models is computationally intensive due to high memory requirements. - Techniques like gradient checkpointing, low rank adaptation, ZeRO
Tweet media one
Tweet media two
Tweet media three
0
23
60
@fly51fly
fly51fly
4 months
[AI] How Far Are We From AGI - The paper provides a comprehensive overview of the current state of AI research and progress towards AGI, covering internal capabilities, interface to the external world, and underlying systems. - It proposes
Tweet media one
Tweet media two
Tweet media three
0
31
60
@fly51fly
fly51fly
3 months
[CL] When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs R Kamoi, Y Zhang, N Zhang, J Han, R Zhang [Penn State University] (2024) - Prior work often does not define research questions on self-correction in
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
13
61
@fly51fly
fly51fly
2 months
[CL] T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings B Deiseroth, M Brack, P Schramowski, K Kersting, S Weinbach [IPAI & Technical University Darmstadt] (2024) - Tokenizers require dedicated training which
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
21
60
@fly51fly
fly51fly
4 months
[LG] Understanding the performance gap between online and offline alignment algorithms Y Tang, D Z Guo, Z Zheng, D Calandriello… [Google DeepMind] (2024) - The paper investigates the performance gap between online and offline alignment algorithms for
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
13
60
@fly51fly
fly51fly
7 months
[CL] LitLLM: A Toolkit for Scientific Literature Review LitLLM is a toolkit for conducting scientific literature reviews. It addresses the limitations of existing tools that use Large Language Models (LLMs) by operating on Retrieval Augmented
Tweet media one
Tweet media two
Tweet media three
0
19
60
@fly51fly
fly51fly
2 months
[CL] Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models - Vision-and-language navigation (VLN) is gaining increasing research attention with the rise of foundation models like BERT, GPT-3, and CLIP. This
Tweet media one
Tweet media two
1
21
59
@fly51fly
fly51fly
2 months
[CL] Understanding Transformers via N-gram Statistics T Nguyen [Google DeepMind] (2024) - The paper studies how transformer LLM predictions depend on context by approximating them with simple N-gram based rules formed from training data statistics. - An
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
16
58
@fly51fly
fly51fly
3 months
[CL] A Survey of Multimodal Large Language Model from A Data-centric Perspective - Multimodal large language models (MLLMs) extend traditional LLMs by integrating multiple modalities like text, vision, audio, etc. MLLMs require diverse,
Tweet media one
Tweet media two
Tweet media three
0
22
59
@fly51fly
fly51fly
25 days
[LG] Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities - Model merging combines multiple models into one model at the parameter level, eliminating the need to store all models separately like ensemble
Tweet media one
Tweet media two
Tweet media three
1
20
59
@fly51fly
fly51fly
2 months
[CL] Patch-Level Training for Large Language Models - Patch-level training is introduced to improve the training efficiency of large language models, which reduces sequence length by compressing multiple tokens into patches. - During patch-level
Tweet media one
Tweet media two
Tweet media three
0
18
58