Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n)
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
- Performs Linear-time RNN by propagating the gradient to the next step, i.e., test-time training
- Achieves better perplexity than Mamba
Google announces TransformerFAM
Feedback attention is working memory
While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel
HGRN2
Gated Linear RNNs with State Expansion
Hierarchically gated linear RNN (HGRN,Qin et al. 2023) has demonstrated competitive training speed and performance in language modeling, while offering efficient inference. However, the recurrent state size of HGRN remains
S4D, S5, RWKV & deepmind's diagonal linear RNN = absorbed by . Paper really underrated. Imagine if Elmo uses this parallelized RNN to train - Transformers won't rule anymore lol.
Thrilled about recent linear attn developments like RetNet, Based, GLA, GateLoop, RWKV-v5/6, TransnormerLLM etc.! Linear xfmrs have massive potential. Check out my my repo for these models' efficient implementations. It's a work in progress
Anyone who has trained a Transformer has viscerally felt its O(T^2) cost. It is not tractable to train Transformers end-to-end on long contexts. Here's a writeup of the research direction I believe is most likely to solve this: linear transformers.
1/7
We've just benchmarked leading efficient large language models against LLaMA, including MAMBA, TransNormerLLM, HGRN and TNN, focusing on the training loss and speed on 30B subset of our own collected corpus up to 3B parameters.
As shown, TransNormerLLM and MAMBA achieve slightly
S4D, S5, RWKV & deepmind's diagonal linear RNN = absorbed by . Paper really underrated. Imagine if Elmo uses this parallelized RNN to train - Transformers won't rule anymore lol.
I really like HGRN2's concept of integrating forget gates with keys in (gated) linear attention. It's a neat and effective approach! We've incorporated HGRN2 into the Flash-Linear-Attention library. Check it out here: .
Data-dependent decay and state dimension expansion are the key for Mamba/GLA matching Transformers!🚀 Also excited to present my NeurIPS spotlight paper [] this Wednesday, which also shows the crucial role of data-dependent decay. Come and chat about RNNs!
Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
paper page:
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module.
(1/7) Happy mother’s day! We think what the mothers of America really want is a Flash Attention implementation that’s just 100 lines of code and 30% faster, and we’re happy to provide.
We're excited to introduce ThunderKittens (TK), a simple DSL embedded within CUDA that makes
Repeat After Me
Transformers are Better than State Space Models at Copying
paper page:
Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on
As someone who specializes in syntactic parsing, I didn't take offense at all. The only reason I conduct research on this topic is because parsing algorithms are really interesting lol
When I hear "X is dead", I think "it's no longer relevant". This is largely applicable to, e.g., syntactic parsing in NLP. Not at all to semantic parsing / code prediction. You could argue that recent GPT models make huge progress on this problem, but its relevance is increasing.
@CFGeek
Sect4 of ABC had already discussed data-independent queries. With online-softmax-based renormalized recurrent state it becomes Aaren from the best of my knowledge.
@bailin_28
@yoonrkim
FLA aims at providing hardware-efficient implementation for modern linear recurrent models (). FLA would not exist without the effort of
@yzhang_cs
!
Linear Log-Normal Attention with Unbiased Concentration The authors propose and test a new self-attention mechanism that improves the scalability of transformer models by emulating the original attention's distribution and concentration behavior. 4/n
@francoisfleuret
The key idea is to only materialize chunk-level hidden states, using matmuls to calculate outputs based on the query, key, and value matrices and chunk-level hidden states. This method avoids materializing the hidden state for every single timestep.
[LG] CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra
A Potapczynski, M Finzi, G Pleiss, A G Wilson [New York University & CMU & University of British Columbia] (2023)
- The paper proposes CoLA (Compositional
4. We additionally impose a lower bound value to push the forget gate values to one, which is crucial for improving RNN's long-term dependency modeling ability .
3. This work investigates the importance of forget gates for linear RNNs, reviving the Gated Impulse Linear Recurrent Layer proposed in , where forget gates depend only on input data instead of previous hidden state, allowing for parallel training
1. Current linear rnns often employ the gating mechanism in the output of recurrence layer (e.g., Gated SSM, RWKV), which amounts to output gates in LSTMs and has been shown very effective to improve linear RNN's performance.
@jacobmbuckman
@manifest__ai
I was completely motivated by . They demonstrated promising results when increasing the head dimension, but their implementation lacked efficiency for scaling up. I haven't tried any other type of gating yet.
5. Without such lower bounds, the GILR layer is not sufficient to have good performance on the LRA benchmark. We also include large-scaled language modeling experiments, showing that linear RNNs with data dependent forget gates have the potential to compete with Transformers!
@CFGeek
I found it similar to a concurrent work . What surprised me the most was that element-wise exp kernel works after all -- when you have a temperature parameter to make the attn score spiky.
@PMinervini
APX-hard in general [], but for some special type of DAGs (e.g. 1-end-crossing, page number 2) there are efficient dynamic programming algorithms []
@MrCatid
@lambdaviking
@jowenpetty
@Ashish_S_AI
Ah I don't think HGRN2 and GLA have nonlinearity in the recurrence bcs their recurrence could be parallelized. The gating in the input and output is indeed nonlinear tho
@francoisfleuret
@dvruette
Hey, just saw the dumb/barrel RNNs – similar to our Gated Linear Attention ()! A few differences in the parameterization, but seems like our insights could apply here too. Interesting stuff!
Compute is all you need.
For a given amount of compute, ViT and ConvNets perform the same.
Quote from this DeepMind article: "Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs
@CFGeek
Unsure about the arrow's direction, considering reversing it and rename discretization to de-discretization. Linear RNNs with data-dependent decay and matrix-valued states subsume approaches like selective SSM and linear attention with forget gate
@CFGeek
mamba paper has an ablation study on this. for SSMs w/o data-dependent parameters, it doesn't matter, for selective SSMs, expanding dimension brings huge gains.
@BlancheMinerva
@CFGeek
v4 is linear rnn w/o data dependent decay and has convolutional view. v6 is linear rnn w/ data dependent decay and 2d matrix-valued state. v5 is like linear attn + data independent decay, similar to retnet.
@Kagurazaka_L
my sense is that data-dependent decay could help reduce the reliance on dimensionality --- both retnet and transnormerLLM will perform much worser when the expansion rate is no greater than 64, while mamba only needs a expansion rate of 16--that will save memory for inference!
For the context, TNN is very similar to the Hyena model and belongs to gated convolution. HGRN is a linear RNN with data-dependent decay but without hidden dimension expansion, while Mamba includes state expansion
TransnormerLLM is very similar to the RetNet model: both are linear attention models with a fixed decay rate. In this regard, they can be considered as linear RNNs with a very large (two-dimensional) hidden state, yet without data-dependent decay
@AlbalakAlon
yep! i noticed that rwkv-v6 has a close relationship to our latest model GLA: , both of which employ data-dependent decay and outer-product-based state expansion.
@__peterwolf__
@manifest__ai
had analyses the instability issue of linear xfmrs, suggesting removing the denominator and applying layernorm in the output, which is adopted by RetNet and GLA.
Infini-gram
Scaling Unbounded n-gram Language Models to a Trillion Tokens
demo:
paper page:
train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing
@__peterwolf__
@manifest__ai
In addition, there is a decay factor in front of the hidden state. So the hidden state norm will not explode over time. Though I didn't try larger context size yet, I don't think instability will be an issue like many early linear transformers.
I was shocked to know that Dr. Jian Sun, my former colleague of the MSRA Visual Computing Group, has passed away. We will miss him dearly. May his soul rest in peace.
Recent Transformer LMs are very good at using their context for predicting new tokens.
How does this capability arise during training?
We study this in our paper "Birth of a Transformer: A Memory Viewpoint" 🐣
Hyena Hierarchy: Towards Larger Convolutional Language Models
propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating
abs:
@dpkingma
RWKV can be described as a gated diagonal state space model, but token shift is a crucial addition that sets it apart. This feature is similar to the "shift SSM" found in H3 and short local convolutions in Hyena.