Songlin Yang @SonglinYang4 Twitter profile

Pinned Tweet

Songlin Yang

18 days

#ICML2024 Come check out our poster on Tuesday, July 23, from 11:30am to 1pm (Hall C 4-9, #500 ) @bailin_28 and @yoonrkim will be presenting GLA there 🥰🥰

Yikang Shen

@Yikang_Shen

8 months

Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n)

6

67

391

6

29

183

Last Seen Profiles

@NashvillePost

@PsyPussifier

@BammerJH

@0xasdf_eth

@judaspriest

@hotcontent12345

@sun_smec

@HolyroodDaily

@evligiz60125833

@penikmatt_ibu

@_sdvh

@Edge510128

@chr1sa

@cream_queen_xxx

@AlyoPrecht

@e_4010

@sweeetnajae

@yuumiap

@Nick029x

@rrre_0803

@penikmatt_ibu

@JayArmourDavis

@hbr0505_773

@nbvmhm

@ZonedGGs

@vcuchp

@ActorSuhas

@junehui

@Coach_Fila

@I5WvGQjSeFekK9O

@brimblehq

@horrorwriter61

@anna_rodenbaugh

@Kevin_charles0

@KingstonNets

@Lucy1Shaw

Songlin Yang

@SonglinYang4

1 month

Online gradient descent version of TTT-linear is a variant of DeltaNet and could be parallelized efficiently:

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention....

arxiv.org

Aran Komatsuzaki

@arankomatsuzaki

1 month

Learning to (Learn at Test Time): RNNs with Expressive Hidden States - Performs Linear-time RNN by propagating the gradient to the next step, i.e., test-time training - Achieves better perplexity than Mamba

7

81

418

8

44

333

Songlin Yang

@SonglinYang4

4 months

I really like this section

AK

@_akhaliq

4 months

Google announces TransformerFAM Feedback attention is working memory While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel

14

145

787

1

19

229

Songlin Yang

@SonglinYang4

1 month

🥳 HGRN2 has been accepted to COLM 2024! It's a minimalist, performant and hardware-efficient gated RNN model. See you in Philly!

AK

@_akhaliq

4 months

HGRN2 Gated Linear RNNs with State Expansion Hierarchically gated linear RNN (HGRN,Qin et al. 2023) has demonstrated competitive training speed and performance in language modeling, while offering efficient inference. However, the recurrent state size of HGRN remains

1

27

79

10

14

160

Songlin Yang

@SonglinYang4

1 year

S4D, S5, RWKV & deepmind's diagonal linear RNN = absorbed by . Paper really underrated. Imagine if Elmo uses this parallelized RNN to train - Transformers won't rule anymore lol.

Parallelizing Linear Recurrent Neural Nets Over Sequence Length

use parallel scan to parallelize linear recurrent neural nets. train model on length 1 million dependency

openreview.net

1

24

152

Songlin Yang

@SonglinYang4

7 months

Thrilled about recent linear attn developments like RetNet, Based, GLA, GateLoop, RWKV-v5/6, TransnormerLLM etc.! Linear xfmrs have massive potential. Check out my my repo for these models' efficient implementations. It's a work in progress

GitHub - sustcsonglin/flash-linear-attention: Efficient implementations of state-of-the-art linear...

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention

github.com

Jacob Buckman

@jacobmbuckman

7 months

Anyone who has trained a Transformer has viscerally felt its O(T^2) cost. It is not tractable to train Transformers end-to-end on long contexts. Here's a writeup of the research direction I believe is most likely to solve this: linear transformers. 1/7

4

27

165

2

17

120

Songlin Yang

@SonglinYang4

8 months

gated convolution< linear rnn w/ dynamic decay < linear rnn w/ dynamic decay+state expansion ~= linear rnn w/ static decay+large state expansion ~=llama

OpenNLPLab

@opennlplab

8 months

We've just benchmarked leading efficient large language models against LLaMA, including MAMBA, TransNormerLLM, HGRN and TNN, focusing on the training loss and speed on 30B subset of our own collected corpus up to 3B parameters. As shown, TransNormerLLM and MAMBA achieve slightly

3

17

98

3

9

102

Songlin Yang

@SonglinYang4

6 months

Interesting work! We've observed similar phenomena. Mamba trained on longer ctx (here 8192) extrapolates much better

4

8

80

Songlin Yang

@SonglinYang4

9 months

Happy to share our NeurIPS'23 paper , which is exactly based on this underrated paper!

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear...

arxiv.org

Songlin Yang

@SonglinYang4

1 year

S4D, S5, RWKV & deepmind's diagonal linear RNN = absorbed by . Paper really underrated. Imagine if Elmo uses this parallelized RNN to train - Transformers won't rule anymore lol.

1

24

152

2

21

72

Songlin Yang

@SonglinYang4

4 months

I really like HGRN2's concept of integrating forget gates with keys in (gated) linear attention. It's a neat and effective approach! We've incorporated HGRN2 into the Flash-Linear-Attention library. Check it out here: .

flash-linear-attention/fla/layers/hgrn2.py at main · sustcsonglin/flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention

github.com

Volodymyr Kyrylov

@darkproger

4 months

Love these state expansion ablations in HGRN2, the outer product with a tied forget gate is super elegant and should work well on a TPU. Amazing job!

0

5

22

1

14

67

Songlin Yang

@SonglinYang4

8 months

Data-dependent decay and state dimension expansion are the key for Mamba/GLA matching Transformers!🚀 Also excited to present my NeurIPS spotlight paper [] this Wednesday, which also shows the crucial role of data-dependent decay. Come and chat about RNNs!

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear...

arxiv.org

Yikang Shen

@Yikang_Shen

8 months

Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n)

6

67

391

1

13

56

Songlin Yang

@SonglinYang4

10 months

Fun paper lol

Never Train from Scratch: Fair Comparison of Long-Sequence Models...

Modeling long-range dependencies across sequences is a longstanding goal in machine learning and has led to architectures, such as state space models, that dramatically outperform Transformers on...

arxiv.org

0

8

55

Songlin Yang

@SonglinYang4

8 months

game-changer

AK

@_akhaliq

8 months

Mamba: Linear-Time Sequence Modeling with Selective State Spaces paper page: Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module.

7

106

534

0

3

39

Songlin Yang

@SonglinYang4

5 months

@srush_nlp I previously implemented mamba using both sequential scan and parallel scan in triton . The sequential scan is faster lol

GitHub - sustcsonglin/mamba-triton

Contribute to sustcsonglin/mamba-triton development by creating an account on GitHub.

github.com

1

3

39

Songlin Yang

@SonglinYang4

3 months

Super cool!

Benjamin F Spector

@bfspector

3 months

(1/7) Happy mother’s day! We think what the mothers of America really want is a Flash Attention implementation that’s just 100 lines of code and 30% faster, and we’re happy to provide. We're excited to introduce ThunderKittens (TK), a simple DSL embedded within CUDA that makes

20

160

906

0

30

Songlin Yang

@SonglinYang4

8 months

@Yikang_Shen We have a standalone GLA layer module available at

GitHub - berlino/gated_linear_attention

Contribute to berlino/gated_linear_attention development by creating an account on GitHub.

github.com

2

7

29

Songlin Yang

@SonglinYang4

7 months

@CFGeek had a try

2

4

28

Songlin Yang

@SonglinYang4

3 months

@francoisfleuret This should be what you are looking for

flash-linear-attention/fla/ops/linear_attn/chunk.py at main · sustcsonglin/flash-linear-attention

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention

github.com

2

4

28

Songlin Yang

@SonglinYang4

10 months

Exactly

1

4

25

Songlin Yang

@SonglinYang4

8 months

wow amazing

Hieu Pham

@hyhieu226

8 months

An astonishingly good tutorial and annotation of Flash Attention code. Recommended for anyone who wants to become a better at neural net programming.

1

90

632

0

4

21

Songlin Yang

@SonglinYang4

7 months

@manifest__ai interesting. i have done similar things in

GitHub - sustcsonglin/flash-linear-attention: Efficient implementations of state-of-the-art linear...

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention

github.com

1

2

22

Songlin Yang

@SonglinYang4

6 months

Very glad to see people using FLA's kernel for research! Check out FLA repo for more implementations

GitHub - sustcsonglin/flash-linear-attention: Efficient implementations of state-of-the-art linear...

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention

github.com

Nikita Balagansky

@nlp_ceo

6 months

@simran_s_arora @_albertgu @tri_dao @RWKV_AI Read the full paper by the ! The source code is available by the . Special thanks to @SonglinYang4 for the fast triton kernels!

1

2

13

0

1

20

Songlin Yang

@SonglinYang4

7 months

oh the referred paper is super intresting!

Justin T Chiu

@justintchiu

7 months

wrote a short note on using parallel scans for backprop: turns out there was already a paper on this too!

0

13

67

1

0

19

Songlin Yang

@SonglinYang4

6 months

I like this title a lot

AK

@_akhaliq

6 months

Repeat After Me Transformers are Better than State Space Models at Copying paper page: Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on

17

85

698

1

16

Songlin Yang

@SonglinYang4

6 months

Happy lunar new year lol

アールエックス

@_AruEkusu_

6 months

0

15

93

1

2

16

Songlin Yang

@SonglinYang4

1 year

As someone who specializes in syntactic parsing, I didn't take offense at all. The only reason I conduct research on this topic is because parsing algorithms are really interesting lol

Matt Gardner

@nlpmattg

1 year

When I hear "X is dead", I think "it's no longer relevant". This is largely applicable to, e.g., syntactic parsing in NLP. Not at all to semantic parsing / code prediction. You could argue that recent GPT models make huge progress on this problem, but its relevance is increasing.

3

0

29

2

0

16

Songlin Yang

@SonglinYang4

3 months

@CFGeek Relatedly, . Online-softmax-renormalization had also been used in

Latent Attention for Linear Time Transformers

The time complexity of the standard attention mechanism in a transformer scales quadratically with the length of the sequence. We introduce a method to reduce this to linear scaling with time,...

arxiv.org

1

0

14

Songlin Yang

@SonglinYang4

5 months

伟大

Ze

@ZeYanjie

5 months

Thank you AK @_akhaliq for sharing our work!😺 For more cool results and videos, see our project website:

0

20

2

0

14

Songlin Yang

@SonglinYang4

7 months

@CFGeek this is exactly what forget gate does

1

0

12

Songlin Yang

@SonglinYang4

3 months

@CFGeek Sect4 of ABC had already discussed data-independent queries. With online-softmax-based renormalized recurrent state it becomes Aaren from the best of my knowledge.

1

12

Songlin Yang

@SonglinYang4

2 years

@aryaman2020 check this one

1

0

12

Songlin Yang

@SonglinYang4

3 months

@jo_brandstetter Great work! Any plans to release code? Interested in reproduce

1

0

9

Songlin Yang

@SonglinYang4

3 months

@aaltomediaai Not surprising. Already shown in our GLA paper

1

0

10

Songlin Yang

@SonglinYang4

18 days

@bailin_28 @yoonrkim FLA aims at providing hardware-efficient implementation for modern linear recurrent models (). FLA would not exist without the effort of @yzhang_cs !

GitHub - sustcsonglin/flash-linear-attention: Efficient implementations of state-of-the-art linear...

Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton - sustcsonglin/flash-linear-attention

github.com

0

1

10

Songlin Yang

@SonglinYang4

18 days

@bailin_28 @yoonrkim If you are interested in it, I previously made a set of slides introducing FLA and GLA in detail:

1

9

Songlin Yang

@SonglinYang4

7 months

@s_mejjoute I had a Triton-based implementation in for those who wanna play with taylor series linear attn

1

9

Songlin Yang

@SonglinYang4

8 months

@srush_nlp with data dependent params they are no longer convolutions haha

1

0

9

Songlin Yang

@SonglinYang4

11 months

This blog is AMAZING!

Simon Boehm

@Si_Boehm

2 years

I wrote the most naive CUDA matrix multiply and iteratively optimised it to ~80% of cuBLAS performance:

16

158

1K

1

2

9

Songlin Yang

@SonglinYang4

5 months

@davidobarber @raresdolga97 @cobzarenco @ai_ucl @UiPath looks really similar to . Two softmax

Efficient Attention: Attention with Linear Complexities

Dot-product attention has wide applications in computer vision and natural language processing. However, its memory and computational costs grow quadratically with the input size. Such growth...

arxiv.org

1

8

Songlin Yang

@SonglinYang4

6 months

Very interesting analyses.

yobibyte

@y0b1byte

9 months

Linear Log-Normal Attention with Unbiased Concentration The authors propose and test a new self-attention mechanism that improves the scalability of transformer models by emulating the original attention's distribution and concentration behavior. 4/n

1

2

10

1

2

8

Songlin Yang

@SonglinYang4

3 months

@francoisfleuret The key idea is to only materialize chunk-level hidden states, using matmuls to calculate outputs based on the query, key, and value matrices and chunk-level hidden states. This method avoids materializing the hidden state for every single timestep.

2

0

8

Songlin Yang

@SonglinYang4

2 months

@fredahshi Congrats Dr Meow

1

0

7

Songlin Yang

@SonglinYang4

11 months

looks exciting

fly51fly

@fly51fly

11 months

[LG] CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra A Potapczynski, M Finzi, G Pleiss, A G Wilson [New York University & CMU & University of British Columbia] (2023) - The paper proposes CoLA (Compositional

0

4

6

0

7

Songlin Yang

@SonglinYang4

9 months

4. We additionally impose a lower bound value to push the forget gate values to one, which is crucial for improving RNN's long-term dependency modeling ability .

Improving the Gating Mechanism of Recurrent Neural Networks

Gating mechanisms are widely used in neural network models, where they allow gradients to backpropagate more easily through depth or time. However, their saturation property introduces problems of...

arxiv.org

1

0

6

Songlin Yang

@SonglinYang4

7 months

really good slides

0

1

6

Songlin Yang

@SonglinYang4

9 months

3. This work investigates the importance of forget gates for linear RNNs, reviving the Gated Impulse Linear Recurrent Layer proposed in , where forget gates depend only on input data instead of previous hidden state, allowing for parallel training

Parallelizing Linear Recurrent Neural Nets Over Sequence Length

Recurrent neural networks (RNNs) are widely used to model sequential data but their non-linear dependencies between sequence elements prevent parallelizing training over sequence length. We show...

arxiv.org

1

0

6

Songlin Yang

@SonglinYang4

4 months

@lambdaviking @jowenpetty @Ashish_S_AI Why does S6 have convolution forms (Eq4) when mamba already has data-dependent transitions?

1

0

6

Songlin Yang

@SonglinYang4

8 months

@CFGeek @Yikang_Shen ah sorry for that. try this one!

GitHub - sustcsonglin/gated_linear_attention_layer

Contribute to sustcsonglin/gated_linear_attention_layer development by creating an account on GitHub.

github.com

1

0

6

Songlin Yang

@SonglinYang4

9 months

1. Current linear rnns often employ the gating mechanism in the output of recurrence layer (e.g., Gated SSM, RWKV), which amounts to output gates in LSTMs and has been shown very effective to improve linear RNN's performance.

1

0

6

Songlin Yang

@SonglinYang4

9 months

2. However, forget gates are considered as the most important gate in LSTMs: , which are largely ignored by the current linear RNN literature.

The unreasonable effectiveness of the forget gate

Given the success of the gated recurrent unit, a natural question is whether all the gates of the long short-term memory (LSTM) network are necessary. Previous research has shown that the forget...

arxiv.org

1

6

Songlin Yang

@SonglinYang4

3 months

@aaltomediaai

0

6

Songlin Yang

@SonglinYang4

7 months

@jacobmbuckman @manifest__ai I was completely motivated by . They demonstrated promising results when increasing the head dimension, but their implementation lacked efficiency for scaling up. I haven't tried any other type of gating yet.

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Huanru Henry Mao. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.

aclanthology.org

0

6

Songlin Yang

@SonglinYang4

3 months

@tianle_cai 难绷

0

6

Songlin Yang

@SonglinYang4

6 months

For the context each line is 1.3B model trained for 100B token on Slimpajama. Re. TBPTT we use 12 chunks and 2048 chunk size each.

0

6

Songlin Yang

@SonglinYang4

9 months

5. Without such lower bounds, the GILR layer is not sufficient to have good performance on the LRA benchmark. We also include large-scaled language modeling experiments, showing that linear RNNs with data dependent forget gates have the potential to compete with Transformers!

1

0

5

Songlin Yang

@SonglinYang4

1 year

@RespectToX For me RWKV is a gated diagonal state space model

1

0

5

Songlin Yang

@SonglinYang4

5 months

@CFGeek @simran_s_arora @EyubogluSabri curious too

0

5

Songlin Yang

@SonglinYang4

8 months

@AlbalakAlon Yeah, we have shown the importance of data-dependent decay in

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear...

arxiv.org

1

0

5

Songlin Yang

@SonglinYang4

9 months

@main_horse @arankomatsuzaki check the Supplementary Material

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor...

Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major...

openreview.net

1

0

4

Songlin Yang

@SonglinYang4

6 months

@CFGeek I found it similar to a concurrent work . What surprised me the most was that element-wise exp kernel works after all -- when you have a temperature parameter to make the attn score spiky.

Linear Log-Normal Attention with Unbiased Concentration

Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention...

arxiv.org

1

0

4

Songlin Yang

@SonglinYang4

1 year

@h2ruk1 恭喜你发现了Unsupervised Tree Induction的真谛（误

0

4

Songlin Yang

@SonglinYang4

7 months

@CFGeek This article is also very beautiful :)

0

4

Songlin Yang

@SonglinYang4

2 years

@PMinervini APX-hard in general [], but for some special type of DAGs (e.g. 1-end-crossing, page number 2) there are efficient dynamic programming algorithms []

Parsing to 1-Endpoint-Crossing, Pagenumber-2 Graphs

Junjie Cao, Sheng Huang, Weiwei Sun, Xiaojun Wan. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017.

aclanthology.org

1

0

4

Songlin Yang

@SonglinYang4

2 years

"Hidden Markov Transformer for Simultaneous Machine Translation". This paper is so interesting and makes my day

Hidden Markov Transformer for Simultaneous Machine Translation

Simultaneous machine translation (SiMT) outputs the target sequence while receiving the source sequence, and hence learning when to start translating each target token is the core challenge for...

openreview.net

0

4

Songlin Yang

@SonglinYang4

9 months

I really like this paper :)

Naomi Saphra

@nsaphra

9 months

I did it

3

6

114

0

4

Songlin Yang

@SonglinYang4

1 year

@RespectToX R is similar to a gate, and W is similar to eigenvalues.

0

4

Songlin Yang

@SonglinYang4

7 months

@manifest__ai i also had some discussions on hardware-efficient linear attn impl. in (shameless self-promotion :)

Gated Linear Attention Transformers with Hardware-Efficient Training

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference...

arxiv.org

2

0

4

Songlin Yang

@SonglinYang4

1 year

@CFGeek They are in duality

Duality of Graphical Models and Tensor Networks

In this article we show the duality between tensor networks and undirected graphical models with discrete variables. We study tensor networks on hypergraphs, which we call tensor hypernetworks. We...

arxiv.org

0

4

Songlin Yang

@SonglinYang4

4 months

@MrCatid @lambdaviking @jowenpetty @Ashish_S_AI Ah I don't think HGRN2 and GLA have nonlinearity in the recurrence bcs their recurrence could be parallelized. The gating in the input and output is indeed nonlinear tho

1

0

4

Songlin Yang

@SonglinYang4

7 months

@francoisfleuret @dvruette Hey, just saw the dumb/barrel RNNs – similar to our Gated Linear Attention ()! A few differences in the parameterization, but seems like our insights could apply here too. Interesting stuff!

Gated Linear Attention Transformers with Hardware-Efficient Training

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference...

arxiv.org

1

0

4

Songlin Yang

@SonglinYang4

10 months

ah remind me of

Yann LeCun

@ylecun

10 months

Compute is all you need. For a given amount of compute, ViT and ConvNets perform the same. Quote from this DeepMind article: "Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs

82

315

2K

0

4

Songlin Yang

@SonglinYang4

5 months

@NekosSelina 太可爱了8

0

4

Songlin Yang

@SonglinYang4

7 months

@CFGeek Unsure about the arrow's direction, considering reversing it and rename discretization to de-discretization. Linear RNNs with data-dependent decay and matrix-valued states subsume approaches like selective SSM and linear attention with forget gate

1

0

3

Songlin Yang

@SonglinYang4

7 months

@HanGuo97 Congrats! 🥰

1

0

3

Songlin Yang

@SonglinYang4

5 months

@liliang_ren congrats 🎉

0

3

Songlin Yang

@SonglinYang4

7 months

@CFGeek mamba paper has an ablation study on this. for SSMs w/o data-dependent parameters, it doesn't matter, for selective SSMs, expanding dimension brings huge gains.

0

3

Songlin Yang

@SonglinYang4

7 months

@CFGeek Mamba has NxD matrix-valued states

1

0

3

Songlin Yang

@SonglinYang4

1 year

@UndefBehavior Haha, let's turn the parsing track into an algorithm track!

0

3

Songlin Yang

@SonglinYang4

7 months

@BlancheMinerva @CFGeek v4 is linear rnn w/o data dependent decay and has convolutional view. v6 is linear rnn w/ data dependent decay and 2d matrix-valued state. v5 is like linear attn + data independent decay, similar to retnet.

0

3

Songlin Yang

@SonglinYang4

8 months

@Kagurazaka_L my sense is that data-dependent decay could help reduce the reliance on dimensionality --- both retnet and transnormerLLM will perform much worser when the expansion rate is no greater than 64, while mamba only needs a expansion rate of 16--that will save memory for inference!

0

3

Songlin Yang

@SonglinYang4

2 years

@yoavgo Inside-Outside and Forward-Backward Algorithms Are Just Backprop (tutorial paper)

0

3

Songlin Yang

@SonglinYang4

8 months

For the context, TNN is very similar to the Hyena model and belongs to gated convolution. HGRN is a linear RNN with data-dependent decay but without hidden dimension expansion, while Mamba includes state expansion

1

0

3

Songlin Yang

@SonglinYang4

8 months

TransnormerLLM is very similar to the RetNet model: both are linear attention models with a fixed decay rate. In this regard, they can be considered as linear RNNs with a very large (two-dimensional) hidden state, yet without data-dependent decay

0

3

Songlin Yang

@SonglinYang4

8 months

@AlbalakAlon yep! i noticed that rwkv-v6 has a close relationship to our latest model GLA: , both of which employ data-dependent decay and outer-product-based state expansion.

Gated Linear Attention Transformers with Hardware-Efficient Training

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference...

arxiv.org

1

0

3

Songlin Yang

@SonglinYang4

7 months

@__peterwolf__ @manifest__ai had analyses the instability issue of linear xfmrs, suggesting removing the denominator and applying layernorm in the output, which is adopted by RetNet and GLA.

1

0

3

Songlin Yang

@SonglinYang4

6 months

@fouriergalois This synthetic dataset is somewhat simplistic. Their iclr rebuttal showed some negative results in extrapolation for language modeling

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many...

openreview.net

1

0

1

Songlin Yang

@SonglinYang4

6 months

@MrCatid @CFGeek Transnormer, RetNet. Simple, efficient and work.

1

0

2

Songlin Yang

@SonglinYang4

9 months

@realDanFu @arankomatsuzaki @_akhaliq Great work! How's this compared to flashbutterfly?

1

0

2

Songlin Yang

@SonglinYang4

2 years

@UndefBehavior the same

0

2

Songlin Yang

@SonglinYang4

7 months

@CFGeek right right. Linear attention expands channel via outer product, SSM expands channel via the single-input-multiple-output (SIMO) mechanism

1

0

2

Songlin Yang

@SonglinYang4

1 year

@arankomatsuzaki what are 1) multiquery local attn 2) TXL recurrence?

1

0

2

Songlin Yang

@SonglinYang4

1 year

@franz_nowak @ryandcotterell Interesting algorithm!

0

2

Songlin Yang

@SonglinYang4

6 months

interesting

AK

@_akhaliq

6 months

Infini-gram Scaling Unbounded n-gram Language Models to a Trillion Tokens demo: paper page: train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing

8

74

401

0

2

Songlin Yang

@SonglinYang4

7 months

@__peterwolf__ @manifest__ai In addition, there is a decay factor in front of the hidden state. So the hidden state norm will not explode over time. Though I didn't try larger context size yet, I don't think instability will be an issue like many early linear transformers.

0

2

Songlin Yang

@SonglinYang4

1 year

@UndefBehavior Single author paper is super cool

0

2

Songlin Yang

@SonglinYang4

9 months

6. Our largest model HGRN-1B trained on the pile for 100B tokens is available at

OpenNLPLab/HGRN-1B · Hugging Face

huggingface.co

0

2

Songlin Yang

@SonglinYang4

2 years

R.I.P.

Yi Ma

@YiMaTweets

2 years

I was shocked to know that Dr. Jian Sun, my former colleague of the MSRA Visual Computing Group, has passed away. We will miss him dearly. May his soul rest in peace.

20

59

602

0

2

Songlin Yang

@SonglinYang4

1 year

Oh really interesting

Alberto Bietti

@albertobietti

1 year

Recent Transformer LMs are very good at using their context for predicting new tokens. How does this capability arise during training? We study this in our paper "Birth of a Transformer: A Memory Viewpoint" 🐣

4

51

289

0

1

2

Songlin Yang

@SonglinYang4

1 year

Super interesting

AK

@_akhaliq

1 year

Hyena Hierarchy: Towards Larger Convolutional Language Models propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating abs:

4

20

119

0

2

Songlin Yang

@SonglinYang4

1 year

@dpkingma RWKV can be described as a gated diagonal state space model, but token shift is a crucial addition that sets it apart. This feature is similar to the "shift SSM" found in H3 and short local convolutions in Hyena.

1

0

2