fly51fly @fly51fly Twitter profile

Last Seen Profiles

@WalkerPrecizion

@ManoTognon

@siggiorri

@Toni7Vk

@0hEI94fSGyTIwRQ

@slvtforbari

@Mingj

@studyparty

@otorihikiiusss

@realDogX

@HaikalT87514

@LUXUARY_TAX

@yukky_otorihiki

@MyrtilleVDW

@kira_nf_

@glitchybeauty

@AbabbaaAsaffa

@urstarnight

@rjf5a

@Saint_Gervais

@LovelessMinki97

@jealovs

@eSportsKEC

@as_pcb

@VmaniakJ

@uni_usamimi

@Saraagaarriiga

@MMrc355551

@0921Kae

@girlskeirin

@linscottjaqueli

@pinoakms

@phrmnikr4

@torihiki_uru_

@MiyuSenpaiVT

@stw_pdg

fly51fly

@fly51fly

9 months

[LG] A mathematical perspective on Transformers A Mathematical Perspective on Transformers: This paper presents a mathematical framework for analyzing Transformers, focusing on their interpretation as interacting particle systems. The study reveals

0

50

291

fly51fly

@fly51fly

9 months

[LG] Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks This paper provides a mathematical explanation for the emergence of Fourier features in learning systems, such as neural networks. It demonstrates that if a neural

3

47

227

fly51fly

@fly51fly

5 months

[CL] Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing Y Tian, B Peng, L Song, L Jin, D Yu, H Mi, D Yu [Tencent AI Lab] (2024) - LLMs still struggle with complex reasoning and planning tasks. Advanced prompting and fine-tuning

1

35

206

fly51fly

@fly51fly

7 months

[LG] How Transformers Learn Causal Structure with Gradient Descent E Nichani, A Damian, J D. Lee [Princeton University] (2024) - The paper studies how transformers learn causal structure through gradient descent when trained on a novel in-context learning

2

49

173

fly51fly

@fly51fly

2 months

[LG] Mixture of A Million Experts X O He [Google DeepMind] (2024) - The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. - Sparse

3

56

159

fly51fly

@fly51fly

9 months

[LG] A Mathematical Guide to Operator Learning Operator learning aims to discover properties of an underlying dynamical system or partial differential equation (PDE) from data. Here, we present a step-by-step guide to operator learning. We explain the

4

37

150

fly51fly

@fly51fly

6 months

[CL] Tokenization Is More Than Compression The paper challenges a widely accepted notion in Natural Language Processing (NLP) that fewer tokens result in better downstream task performance. The authors introduce a new tokenizer, PathPiece, designed to

1

38

148

fly51fly

@fly51fly

15 days

[IR] Meta Knowledge for Retrieval Augmented Large Language Models L Mombaerts, T Ding, A Banerjee, F Felice... [Amazon Web Services] (2024) - The paper introduces a novel data-centric workflow for retrieval augmented large language models (RAG),

0

34

146

fly51fly

@fly51fly

4 months

[LG] Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory X Niu, B Bai, L Deng, W Han [Huawei Technologies Co., Ltd] (2024) - Increasing model size does not always improve performance. The scaling laws do not explain this

1

29

125

fly51fly

@fly51fly

2 months

[CL] A Closer Look into Mixture-of-Experts in Large Language Models - Experts act like fine-grained neurons. The gate embedding determines expert selection while the gate projection matrix controls neuron activation. Their heatmaps are correlated,

3

38

127

fly51fly

@fly51fly

6 months

[LG] Fine-tuning with Very Large Dropout J Zhang, L Bottou [New York University & Facebook AI Research] (2024) - It is common to pre-train models on large datasets and then fine-tune them on smaller datasets for specific tasks. This violates the

2

34

120

fly51fly

@fly51fly

1 year

[LG] ExpeL: LLM Agents Are Experiential Learners A Zhao, D Huang, Q Xu, M Lin, Y Liu, G Huang [Tsinghua University] (2023) - ExpeL is an autonomous agent that learns from experiences without parameter updates, compatible with proprietary LLMs. - During

2

33

116

fly51fly

@fly51fly

6 months

[CL] Understanding Emergent Abilities of Language Models from the Loss Perspective Z Du, A Zeng, Y Dong, J Tang [Zhipu AI & Tsinghua University] (2024) - Recent studies question if emergent abilities in LMs are exclusive to large models, as smaller models

0

26

117

fly51fly

@fly51fly

6 months

[LG] Mechanics of Next Token Prediction with Self-Attention Y Li, Y Huang, M. E Ildiz, A S Rawat, S Oymak [University of Michigan & Google Research NYC] (2024) - Self-attention learns to retrieve high-priority tokens from the input sequence and creates a

0

34

114

fly51fly

@fly51fly

2 years

[LG] Training trajectories, mini-batch losses and the curious role of the learning rate M Sandler, A Zhmoginov, M Vladymyrov, N Miller [Google Research] (2023) #MachineLearning #ML #AI

3

19

113

fly51fly

@fly51fly

5 months

[LG] An Overview of Diffusion Models: Applications, Guided Generation, Statistical Rates and Optimization - Diffusion models have achieved tremendous success in generative modeling and sample generation in computer vision, audio, reinforcement

1

25

114

fly51fly

@fly51fly

2 years

[LG] Riemannian Flow Matching on General Geometries R T. Q. Chen, Y Lipman [Meta AI] (2023) #MachineLearning #ML #AI

2

18

106

fly51fly

@fly51fly

4 months

[LG] DPO Meets PPO: Reinforced Token Optimization for RLHF - This paper models RLHF as an MDP, offering a token-wise characterization of LLM's generation process. It theoretically demonstrates advantages of token-wise MDP over sentence-wise bandit

0

38

107

fly51fly

@fly51fly

4 months

[LG] FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information D Hwang [Google LLC] (2024) - Adam optimizer can be considered as an approximation of natural gradient descent through the use of diagonal empirical Fisher

2

34

99

fly51fly

@fly51fly

5 months

[CL] Toward a Theory of Tokenization in LLMs N Rajaraman, J Jiao, K Ramchandran [UC Berkeley] (2024) - Transformers trained on data from certain simple high-order Markov processes fail to learn the underlying distribution and instead predict characters

2

29

100

fly51fly

@fly51fly

1 year

[CV] Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion J Tian, L Aggarwal, A Colaco, Z Kira, M Gonzalez-Franco [Georgia Institute of Technology & Google] (2023) - Proposes DiffSeg, an unsupervised zero-shot

0

26

99

fly51fly

@fly51fly

8 months

[CL] Large Language Models for Generative Information Extraction: A Survey Application of Large Language Models (LLMs) in generative information extraction. The study explores the remarkable capabilities of LLMs in text understanding and generation,

1

27

98

fly51fly

@fly51fly

2 years

[LG] GNNInterpreter: A Probabilistic Generative Model-Level Explanation for Graph Neural Networks (2023) #MachineLearning #ML #AI [1/3]

1

17

96

fly51fly

@fly51fly

7 months

[LG] Deep Networks Always Grok and Here is Why A I Humayun, R Balestriero, R Baraniuk [Rice University] (2024) - Grokking, where generalization occurs long after achieving near zero training error, is more widespread than previously thought. It occurs not

2

23

97

fly51fly

@fly51fly

9 months

[CL] LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment S Dou, E Zhou, Y Liu, S Gao, J Zhao... [Fudan University] (2023) - Vanilla supervised fine-tuning (SFT) with a massive amount of instruction data

1

30

97

fly51fly

@fly51fly

2 years

[LG] A Survey of Geometric Optimization for Deep Learning: From Euclidean Space to Riemannian Manifold Y Fei, X Wei, Y Liu, Z Li, M Chen [East China Normal University] (2023) #MachineLearning #ML #AI

1

23

94

fly51fly

@fly51fly

5 months

[LG] Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models B Pan, Y Shen, H Liu, M Mishra, G Zhang, A Oliva, C Raffel, R Panda [MIT & University of Toronto] (2024) - MoE models usually require 2-4x more parameters

2

32

94

fly51fly

@fly51fly

2 years

[CL] Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval J Wieting, J H. Clark, W W. Cohen, G Neubig... [Google Research & CMU & University of California San Diego, San Diego] (2022) #MachineLearning #ML #AI #NLP #NLProc

2

17

93

fly51fly

@fly51fly

2 years

[CL] Diffusion-LM Improves Controllable Text Generation X L Li, J Thickstun, I Gulrajani, P Liang, T B. Hashimoto [Stanford University] (2022) #MachineLearning #ML #AI #NLP #NLProc

2

18

90

fly51fly

@fly51fly

6 months

[LG] Model Stock: All we need is just a few fine-tuned models D Jang, S Yun, D Han [NAVER AI Lab] (2024) - Fine-tuned models with different random seeds lie on a thin shell in weight space layer-wise. - Proximity of weights to the center of this thin

0

26

91

fly51fly

@fly51fly

2 years

[LG] Unsupervised Manifold Linearizing and Clustering T Ding, S Tong, K H R Chan, X Dai, Y Ma, B D. Haeffele [Johns Hopkins University & UC Berkeley] (2023) #MachineLearning #ML #AI #CV

2

19

89

fly51fly

@fly51fly

8 months

[CV] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs S Tong, Z Liu, Y Zhai, Y Ma, Y LeCun, S Xie [New York University & UC Berkeley] (2024) - The paper finds that multimodal large language models (MLLMs) still exhibit systematic

1

20

89

fly51fly

@fly51fly

5 months

[CL] Mixture of LoRA Experts X Wu, S Huang, F Wei [Microsoft Research Asia] (2024) - LoRA (Low-Rank Adaptation) is an effective technique for fine-tuning large pre-trained models. Composing multiple trained LoRAs can enhance performance across tasks. -

0

29

88

fly51fly

@fly51fly

6 months

[CL] The pitfalls of next-token prediction G Bachmann, V Nagarajan [ETH Zürich & Google Research] (2024) - Next-token prediction has become a core part of modern language models, but there is a growing belief that it cannot truly model human thought and

2

19

86

fly51fly

@fly51fly

2 months

[CL] Tree Search for Language Model Agents J Y Koh, S McAleer, D Fried, R Salakhutdinov [CMU] (2024) - This paper proposes a search algorithm for language model agents to improve performance on multi-step web automation tasks. Language models struggle

1

31

86

fly51fly

@fly51fly

1 month

[CV] Tails Tell Tales: Chapter-Wide Manga Transcriptions with Character Names R Sachdeva, G Shin, A Zisserman [University of Oxford] (2024) - The paper introduces Magiv2, a model that can generate high-quality chapter-wide manga transcripts with named

9

48

83

fly51fly

@fly51fly

4 months

[LG] Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers - Common transformer architectures like BERT and GPT-2 structurally cannot represent additive models like linear models or generalized additive

0

34

84

fly51fly

@fly51fly

1 month

[LG] A Survey of Mamba - Mamba is an emerging deep learning architecture that has demonstrated remarkable success across diverse domains due to its powerful modeling capabilities and computational efficiency. - Mamba is inspired by classical

0

27

84

fly51fly

@fly51fly

6 months

[LG] Masked Autoencoders are PDE Learners A Zhou, A B Farimani [CMU] (2024) - Masked autoencoders can learn useful latent representations for PDEs through self-supervised pretraining on unlabeled spatiotemporal data. This allows them to improve

0

25

85

fly51fly

@fly51fly

4 months

[LG] A Survey on the Memory Mechanism of Large Language Model based Agents - The memory module is a key component that differentiates agents from original large language models (LLMs), enabling agent-environment interactions. - Memory serves

1

36

84

fly51fly

@fly51fly

11 months

[LG] Riemannian Residual Neural Networks I Katsman, E M Chen, S Holalkere, A Asch, A Lou, S Lim, C D Sa [Yale University & Cornell University & Stanford University & University of Central Florida] (2023) - The paper proposes a novel and principled

0

22

84

fly51fly

@fly51fly

6 months

[CL] Latent Attention for Linear Time Transformers The paper introduces the "Latte Transformer," a new transformer model that reduces the time complexity of the attention mechanism from quadratic to linear by employing latent vectors. This approach

0

20

82

fly51fly

@fly51fly

3 months

[LG] Learning Iterative Reasoning through Energy Diffusion Y Du, J Mao, J B. Tenenbaum [MIT] (2024) - This paper introduces iterative reasoning through energy diffusion (IRED), a novel framework for learning to reason for various tasks by formulating

1

24

79

fly51fly

@fly51fly

3 months

[CL] What Do Language Models Learn in Context? The Structured Task Hypothesis - The paper examines three hypotheses on how LLMs perform in-context learning: task selection, meta-learning, and structured task composition. - It invalidates task

0

18

79

fly51fly

@fly51fly

6 months

[CL] Datasets for Large Language Models: A Comprehensive Survey The survey delves into the significance of datasets for Large Language Models (LLMs) and their foundational role in LLM advancement. It emphasizes that high-quality datasets are essential

1

18

79

fly51fly

@fly51fly

5 months

[LG] Gradient Networks S Chaudhari, S Pranav, J M. F. Moura [CMU] (2024) - The paper introduces gradient networks (GradNets) to directly parameterize gradients of arbitrary functions. GradNets have architectural constraints to ensure they correspond to

0

23

78

fly51fly

@fly51fly

9 months

[CL] ReLoRA: High-Rank Training Through Low-Rank Updates V Lialin, N Shivagunde, S Muckatira, A Rumshisky [University of Massachusetts Lowell] (2023) - Motivation: Understanding why overparametrized models with billions of parameters are needed for

2

17

78

fly51fly

@fly51fly

6 months

[LG] Understanding Diffusion Models by Feynman's Path Integral The paper introduces a novel approach to understanding score-based diffusion models by employing Feynman's path integral formalism, traditionally used in quantum physics. It provides a

3

20

77

fly51fly

@fly51fly

14 days

[LG] A Law of Next-Token Prediction in Large Language Models H He, W J. Su [University of Rocheste & University of Pennsylvania] (2024)

0

20

78

fly51fly

@fly51fly

2 months

[LG] How to Boost Any Loss Function R Nock, Y Mansour [Google Research] (2024) - The paper presents a boosting algorithm that can optimize essentially any loss function without using first-order information. This is in contrast to traditional boosting

1

19

76

fly51fly

@fly51fly

8 months

[CL] MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining J Portes, A Trott, S Havens, D King, A Venigalla, M Nadeem, N Sardana, D Khudia, J Frankle [MosaicML × Databricks] (2023) - Proposes MosaicBERT, a modified BERT architecture optimized

1

17

73

fly51fly

@fly51fly

6 months

[CL] How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning S Dutta, J Singh, S Chakrabarti, T Chakraborty [IIT Delhi & IIT Bombay] (2024) - Despite superior reasoning ability with Chain-of-Thought (CoT) prompting, the

2

30

77

fly51fly

@fly51fly

4 months

[CL] Text Quality-Based Pruning for Efficient Training of Language Models V Sharma, K Padthe, N Ardalani, K Tirumala… [FAIR, Meta] (2024) - The paper proposes a novel method to numerically evaluate text quality in large unlabeled NLP datasets in a

1

18

75

fly51fly

@fly51fly

17 days

[LG] Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations R Csordás, C Potts, C D. Manning, A Geiger [Stanford University] (2024) - RNNs learn non-linear representations to store and generate sequences,

3

19

76

fly51fly

@fly51fly

7 months

[LG] Learning a Decision Tree Algorithm with Transformers Y Zhuang, L Liu, C Singh, J Shang, J Gao [Microsoft Research & UC San Diego] (2024) - The paper proposes MetaTree, a transformer-based model that learns to generate decision trees. - MetaTree is

1

17

73

fly51fly

@fly51fly

5 months

[CL] sDPO: Don't Use Your Data All at Once D Kim, Y Kim, W Song, H Kim, Y Kim, S Kim, C Park [Upstage AI] (2024) - They propose stepwise DPO (sDPO), which divides available preference datasets and uses them step-by-step, instead of all at once like

1

22

73

fly51fly

@fly51fly

2 months

[CL] Transformer Layers as Painters Q Sun, M Pickett, A K Nain, L Jones [Emergence AI] (2024) - The paper investigates the internal workings of transformer layers in large pre-trained language models by conducting empirical studies on frozen models. -

2

19

74

fly51fly

@fly51fly

9 months

[LG] Chain-of-Thought Reasoning is a Policy Improvement Operator

2

18

74

fly51fly

@fly51fly

5 months

[LG] The Topos of Transformer Networks M J Villani, P McBurney [King’s College London] (2024) - The transformer neural network architecture has achieved great success in natural language processing tasks. This paper provides a theoretical analysis of the

3

19

75

fly51fly

@fly51fly

2 years

[LG] What Do We Maximize in Self-Supervised Learning And Why Does Generalization Emerge? R Shwartz-Ziv, R Balestriero, K Kawaguchi, Y LeCun [New York University & Facebook] (2023) #MachineLearning #ML #AI

0

22

75

fly51fly

@fly51fly

1 year

[CL] Instruction Tuning for Large Language Models: A Survey S Zhang, L Dong, X Li, S Zhang, X Sun, S Wang, J Li, R Hu, T Zhang, F Wu, G Wang [Zhejiang University & ] (2023) - Instruction tuning (IT) is a crucial technique to enhance

0

25

74

fly51fly

@fly51fly

6 months

[LG] Denoising Autoregressive Representation Learning Y Li, J Bornschein, T Chen [Google DeepMind] (2024) - The paper proposes Denoising Autoregressive Representation Learning (DARL), which uses a decoder-only Transformer to predict image patches

0

20

69

fly51fly

@fly51fly

3 months

[CL] Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? G Yona, R Aharoni, M Geva [Google Research] (2024) - The paper proposes the notion of "faithful response uncertainty", which measures how well a language model

2

22

71

fly51fly

@fly51fly

7 months

[CL] A Language Model's Guide Through Latent Space D v Rütte, S Anagnostidis, G Bachmann, T Hofmann [ETH Zurich] (2024) - The paper systematically compares linear probing and guidance techniques on hidden representations in large language models (LLMs)

0

28

70

fly51fly

@fly51fly

2 months

[LG] Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think L Sernau [Google DeepMind] (2024) - Infinite width models like NTKs historically underperform finite models. This is attributed to lack of feature learning. -

2

28

72

fly51fly

@fly51fly

2 months

[LG] A Survey on LoRA of Large Language Models - LoRA is an effective parameter efficient fine-tuning paradigm that updates dense layers of neural networks with pluggable low-rank matrices. - LoRA is computationally efficient and provides good

0

29

71

fly51fly

@fly51fly

3 months

[LG] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality T Dao, A Gu [Princeton University & CMU] (2024) - This paper shows theoretical connections between structured state space models (SSMs),

1

19

70

fly51fly

@fly51fly

2 months

[LG] Information-Theoretic Foundations for Machine Learning - The paper provides a mathematical framework using Bayesian statistics and information theory to analyze the fundamental limits of machine learning performance. - It establishes a

0

22

70

fly51fly

@fly51fly

5 months

[LG] Continual Learning of Large Language Models: A Comprehensive Survey - Large language models (LLMs) have shown promise for artificial general intelligence (AGI), but face challenges adapting to new data distributions, tasks, and users without

0

27

72

fly51fly

@fly51fly

2 years

[LG] Dataset Distillation: A Comprehensive Review R Yu, S Liu, X Wang [National University of Singapore] (2023) #MachineLearning #ML #AI [1/2]

1

16

68

fly51fly

@fly51fly

9 months

[LG] Manifold Diffusion Fields A A. Elhag, J M. Susskind, M A Bautista [Apple] (2023) - This paper proposes a method called Manifold Diffusion Fields (MDF) to learn generative models on Riemannian manifolds. - It utilizes insights from spectral

2

15

69

fly51fly

@fly51fly

2 months

[LG] On the Anatomy of Attention N Khatri, T Laakkonen, J Liu, V Wang-Maścianica [Quantinuum] (2024) - The paper introduces a category-theoretic diagrammatic formalism to systematically relate and reason about machine learning models. - The diagrams

1

16

68

fly51fly

@fly51fly

4 months

[CL] Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations Z Ma, Z Wang, J Chai [University of Michigan] (2024) - This work introduces a trial-and-demonstration (TnD) learning framework to examine the

3

21

68

fly51fly

@fly51fly

4 months

[LG] Attention as an RNN L Feng, F Tung, H Hajimirsadeghi, M O Ahmed, Y Bengio, G Mori [Borealis AI] (2024) - Transformers brought breakthroughs in sequence modeling but are inefficient for low-resource settings due to quadratic complexity. - This work

1

20

68

fly51fly

@fly51fly

3 months

[CL] Chain of Agents: Large Language Models Collaborating on Long-Context Tasks Y Zhang, R Sun, Y Chen, T Pfister… [Google Cloud AI Research & Penn State University] (2024) - Chain-of-Agents (CoA) is a multi-agent LLM collaboration framework for solving

1

25

67

fly51fly

@fly51fly

6 months

[LG] Learning Associative Memories with Gradient Descent V Cabannes, B Simsek, A Bietti [Meta AI & Flatiron Institute] (2024) - The paper studies the training dynamics of associative memory models trained with cross-entropy loss. It shows the dynamics can

1

16

66

fly51fly

@fly51fly

2 months

[CL] AgentInstruct: Toward Generative Teaching with Agentic Flows - Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. - Despite several successful use cases, researchers

3

25

66

fly51fly

@fly51fly

6 months

[LG] What makes an image realistic? L Theis [Google DeepMind,] (2024) - Quantifying realism remains challenging despite progress in generating realistic data. A good generative model alone is insufficient. - Probability and typicality fail to quantify

1

21

66

fly51fly

@fly51fly

5 months

[CL] Let's Think Dot by Dot: Hidden Computation in Transformer Language Models J Pfau, W Merrill, S R. Bowman [New York University] (2024) - Let's Think Dot by Dot shows that transformers can use meaningless filler tokens like '......' to solve complex

3

20

64

fly51fly

@fly51fly

2 months

[LG] Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data A Antoniades, X Wang, Y Elazar, A Amayuelas, A Albalak, K Zhang, W Y Wang [Universig of Califormia, Sonta Barbara & Allen Institute for Al] (2024) -

2

14

63

fly51fly

@fly51fly

2 months

[LG] Vision language models are blind P Rahmanzadehgervi, L Bolton, M R Taesiri, A T Nguyen [Auburn University] (2024) - Large language models with vision capabilities (VLMs) are scoring high on vision-understanding benchmarks but may not actually be

2

13

63

fly51fly

@fly51fly

19 days

[CL] LLM Pruning and Distillation in Practice: The Minitron Approach S T Sreenivas, S Muralidharan, R Joshi, M Chochowski… [NVIDIA] (2024) - The paper presents a comprehensive report on compressing Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B

2

16

63

fly51fly

@fly51fly

2 months

[CV] LookupViT: Compressing visual information to a limited number of tokens - LookupViT introduces a novel Multi-Head Bidirectional Cross-attention (MHBC) module that enables effective information flow with significant computational savings. -

1

11

63

fly51fly

@fly51fly

6 months

[LG] AutoEval Done Right: Using Synthetic Data for Model Evaluation P Boyeau, A N. Angelopoulos, N Yosef, J Malik, M I. Jordan (2024) - Autoevaluation using AI-labeled synthetic data can significantly reduce the need for human annotations when evaluating

2

19

62

fly51fly

@fly51fly

2 months

[LG] RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold A Setlur, S Garg, X Geng, N Garg… [CMU & Google DeepMind] (2024) - Training on model-generated synthetic data is promising for finetuning LLMs on math

3

19

61

fly51fly

@fly51fly

2 months

[CL] Automata-based constraints for language model decoding T Koo, F Liu, L He [Google DeepMind] (2024) - Language models are often expected to generate strings in some formal language, but this is not guaranteed, especially with smaller LMs. Tuning

2

17

61

fly51fly

@fly51fly

4 months

[CL] A Primer on the Inner Workings of Transformer-based Language Models - The paper provides a concise technical introduction to interpretability techniques used to analyze Transformer-based language models, focusing on the generative decoder-only

2

19

60

fly51fly

@fly51fly

6 months

[CL] Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking E Zelikman, G Harik, Y Shao, V Jayasiri, N Haber, N D. Goodman [Stanford University & Notbad AI Inc] (2024) - Quiet-STaR is a generalization of STaR that allows language models

0

17

61

fly51fly

@fly51fly

3 months

[LG] Attention as a Hypernetwork S Schug, S Kobayashi, Y Akram, J Sacramento, R Pascanu [ETH Zürich] (2024) - Transformers can generalize to novel compositional problem instances whose parts were seen during training. This paper aims to understand the

1

22

59

fly51fly

@fly51fly

3 months

[LG] A Study of Optimizations for Fine-tuning Large Language Models - Fine-tuning large language models is computationally intensive due to high memory requirements. - Techniques like gradient checkpointing, low rank adaptation, ZeRO

0

23

60

fly51fly

@fly51fly

4 months

[AI] How Far Are We From AGI - The paper provides a comprehensive overview of the current state of AI research and progress towards AGI, covering internal capabilities, interface to the external world, and underlying systems. - It proposes

0

31

60

fly51fly

@fly51fly

3 months

[CL] When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs R Kamoi, Y Zhang, N Zhang, J Han, R Zhang [Penn State University] (2024) - Prior work often does not define research questions on self-correction in

1

13

61

fly51fly

@fly51fly

2 months

[CL] T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings B Deiseroth, M Brack, P Schramowski, K Kersting, S Weinbach [IPAI & Technical University Darmstadt] (2024) - Tokenizers require dedicated training which

1

21

60

fly51fly

@fly51fly

4 months

[LG] Understanding the performance gap between online and offline alignment algorithms Y Tang, D Z Guo, Z Zheng, D Calandriello… [Google DeepMind] (2024) - The paper investigates the performance gap between online and offline alignment algorithms for

2

13

60

fly51fly

@fly51fly

7 months

[CL] LitLLM: A Toolkit for Scientific Literature Review LitLLM is a toolkit for conducting scientific literature reviews. It addresses the limitations of existing tools that use Large Language Models (LLMs) by operating on Retrieval Augmented

0

19

60

fly51fly

@fly51fly

2 months

[CL] Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models - Vision-and-language navigation (VLN) is gaining increasing research attention with the rise of foundation models like BERT, GPT-3, and CLIP. This

1

21

59

fly51fly

@fly51fly

2 months

[CL] Understanding Transformers via N-gram Statistics T Nguyen [Google DeepMind] (2024) - The paper studies how transformer LLM predictions depend on context by approximating them with simple N-gram based rules formed from training data statistics. - An

1

16

58

fly51fly

@fly51fly

3 months

[CL] A Survey of Multimodal Large Language Model from A Data-centric Perspective - Multimodal large language models (MLLMs) extend traditional LLMs by integrating multiple modalities like text, vision, audio, etc. MLLMs require diverse,

0

22

59

fly51fly

@fly51fly

25 days

[LG] Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities - Model merging combines multiple models into one model at the parameter level, eliminating the need to store all models separately like ensemble

1

20

59

fly51fly

@fly51fly

2 months

[CL] Patch-Level Training for Large Language Models - Patch-level training is introduced to improve the training efficiency of large language models, which reduces sequence length by compressing multiple tokens into patches. - During patch-level

0

18

58