Songlin Yang Profile
Songlin Yang

@SonglinYang4

2,498
Followers
2,115
Following
7
Media
1,302
Statuses

PhD student @MIT_CSAIL , interning @NVIDIA . Prev. @ShanghaiTechUni @SUSTechSZ . Working on scalable and principled methods in #ML & #NLProc . she/her/hers

Cambridge
Joined January 2021
Don't wanna be here? Send us removal request.
Pinned Tweet
@SonglinYang4
Songlin Yang
18 days
#ICML2024 Come check out our poster on Tuesday, July 23, from 11:30am to 1pm (Hall C 4-9, #500 ) @bailin_28 and @yoonrkim will be presenting GLA there 🥰🥰
Tweet media one
@Yikang_Shen
Yikang Shen
8 months
Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n)
Tweet media one
6
67
391
6
29
183
@SonglinYang4
Songlin Yang
1 month
Online gradient descent version of TTT-linear is a variant of DeltaNet and could be parallelized efficiently:
@arankomatsuzaki
Aran Komatsuzaki
1 month
Learning to (Learn at Test Time): RNNs with Expressive Hidden States - Performs Linear-time RNN by propagating the gradient to the next step, i.e., test-time training - Achieves better perplexity than Mamba
Tweet media one
7
81
418
8
44
333
@SonglinYang4
Songlin Yang
4 months
I really like this section
Tweet media one
@_akhaliq
AK
4 months
Google announces TransformerFAM Feedback attention is working memory While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel
Tweet media one
14
145
787
1
19
229
@SonglinYang4
Songlin Yang
1 month
🥳 HGRN2 has been accepted to COLM 2024! It's a minimalist, performant and hardware-efficient gated RNN model. See you in Philly!
@_akhaliq
AK
4 months
HGRN2 Gated Linear RNNs with State Expansion Hierarchically gated linear RNN (HGRN,Qin et al. 2023) has demonstrated competitive training speed and performance in language modeling, while offering efficient inference. However, the recurrent state size of HGRN remains
Tweet media one
1
27
79
10
14
160
@SonglinYang4
Songlin Yang
1 year
S4D, S5, RWKV & deepmind's diagonal linear RNN = absorbed by . Paper really underrated. Imagine if Elmo uses this parallelized RNN to train - Transformers won't rule anymore lol.
1
24
152
@SonglinYang4
Songlin Yang
7 months
Thrilled about recent linear attn developments like RetNet, Based, GLA, GateLoop, RWKV-v5/6, TransnormerLLM etc.! Linear xfmrs have massive potential. Check out my my repo for these models' efficient implementations. It's a work in progress
@jacobmbuckman
Jacob Buckman
7 months
Anyone who has trained a Transformer has viscerally felt its O(T^2) cost. It is not tractable to train Transformers end-to-end on long contexts. Here's a writeup of the research direction I believe is most likely to solve this: linear transformers. 1/7
Tweet media one
4
27
165
2
17
120
@SonglinYang4
Songlin Yang
8 months
gated convolution< linear rnn w/ dynamic decay < linear rnn w/ dynamic decay+state expansion ~= linear rnn w/ static decay+large state expansion ~=llama
@opennlplab
OpenNLPLab
8 months
We've just benchmarked leading efficient large language models against LLaMA, including MAMBA, TransNormerLLM, HGRN and TNN, focusing on the training loss and speed on 30B subset of our own collected corpus up to 3B parameters. As shown, TransNormerLLM and MAMBA achieve slightly
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
17
98
3
9
102
@SonglinYang4
Songlin Yang
6 months
Interesting work! We've observed similar phenomena. Mamba trained on longer ctx (here 8192) extrapolates much better
Tweet media one
4
8
80
@SonglinYang4
Songlin Yang
9 months
Happy to share our NeurIPS'23 paper , which is exactly based on this underrated paper!
@SonglinYang4
Songlin Yang
1 year
S4D, S5, RWKV & deepmind's diagonal linear RNN = absorbed by . Paper really underrated. Imagine if Elmo uses this parallelized RNN to train - Transformers won't rule anymore lol.
1
24
152
2
21
72
@SonglinYang4
Songlin Yang
4 months
I really like HGRN2's concept of integrating forget gates with keys in (gated) linear attention. It's a neat and effective approach! We've incorporated HGRN2 into the Flash-Linear-Attention library. Check it out here: .
@darkproger
Volodymyr Kyrylov
4 months
Love these state expansion ablations in HGRN2, the outer product with a tied forget gate is super elegant and should work well on a TPU. Amazing job!
Tweet media one
0
5
22
1
14
67
@SonglinYang4
Songlin Yang
8 months
Data-dependent decay and state dimension expansion are the key for Mamba/GLA matching Transformers!🚀 Also excited to present my NeurIPS spotlight paper [] this Wednesday, which also shows the crucial role of data-dependent decay. Come and chat about RNNs!
@Yikang_Shen
Yikang Shen
8 months
Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n)
Tweet media one
6
67
391
1
13
56
@SonglinYang4
Songlin Yang
8 months
game-changer
@_akhaliq
AK
8 months
Mamba: Linear-Time Sequence Modeling with Selective State Spaces paper page: Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module.
Tweet media one
7
106
534
0
3
39
@SonglinYang4
Songlin Yang
5 months
@srush_nlp I previously implemented mamba using both sequential scan and parallel scan in triton . The sequential scan is faster lol
1
3
39
@SonglinYang4
Songlin Yang
3 months
Super cool!
@bfspector
Benjamin F Spector
3 months
(1/7) Happy mother’s day! We think what the mothers of America really want is a Flash Attention implementation that’s just 100 lines of code and 30% faster, and we’re happy to provide. We're excited to introduce ThunderKittens (TK), a simple DSL embedded within CUDA that makes
Tweet media one
20
160
906
0
0
30
@SonglinYang4
Songlin Yang
7 months
@CFGeek had a try
Tweet media one
2
4
28
@SonglinYang4
Songlin Yang
10 months
Exactly
Tweet media one
Tweet media two
1
4
25
@SonglinYang4
Songlin Yang
8 months
wow amazing
@hyhieu226
Hieu Pham
8 months
An astonishingly good tutorial and annotation of Flash Attention code. Recommended for anyone who wants to become a better at neural net programming.
1
90
632
0
4
21
@SonglinYang4
Songlin Yang
6 months
Very glad to see people using FLA's kernel for research! Check out FLA repo for more implementations
@nlp_ceo
Nikita Balagansky
6 months
@simran_s_arora @_albertgu @tri_dao @RWKV_AI Read the full paper by the ! The source code is available by the . Special thanks to @SonglinYang4 for the fast triton kernels!
1
2
13
0
1
20
@SonglinYang4
Songlin Yang
7 months
oh the referred paper is super intresting!
@justintchiu
Justin T Chiu
7 months
wrote a short note on using parallel scans for backprop: turns out there was already a paper on this too!
0
13
67
1
0
19
@SonglinYang4
Songlin Yang
6 months
I like this title a lot
@_akhaliq
AK
6 months
Repeat After Me Transformers are Better than State Space Models at Copying paper page: Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on
Tweet media one
17
85
698
1
1
16
@SonglinYang4
Songlin Yang
6 months
Happy lunar new year lol
@_AruEkusu_
アールエックス
6 months
Tweet media one
0
15
93
1
2
16
@SonglinYang4
Songlin Yang
1 year
As someone who specializes in syntactic parsing, I didn't take offense at all. The only reason I conduct research on this topic is because parsing algorithms are really interesting lol
@nlpmattg
Matt Gardner
1 year
When I hear "X is dead", I think "it's no longer relevant". This is largely applicable to, e.g., syntactic parsing in NLP. Not at all to semantic parsing / code prediction. You could argue that recent GPT models make huge progress on this problem, but its relevance is increasing.
3
0
29
2
0
16
@SonglinYang4
Songlin Yang
5 months
伟大
@ZeYanjie
Ze
5 months
Thank you AK @_akhaliq for sharing our work!😺 For more cool results and videos, see our project website:
0
0
20
2
0
14
@SonglinYang4
Songlin Yang
7 months
@CFGeek this is exactly what forget gate does
1
0
12
@SonglinYang4
Songlin Yang
3 months
@CFGeek Sect4 of ABC had already discussed data-independent queries. With online-softmax-based renormalized recurrent state it becomes Aaren from the best of my knowledge.
1
1
12
@SonglinYang4
Songlin Yang
2 years
@aryaman2020 check this one
1
0
12
@SonglinYang4
Songlin Yang
3 months
@jo_brandstetter Great work! Any plans to release code? Interested in reproduce
1
0
9
@SonglinYang4
Songlin Yang
3 months
@aaltomediaai Not surprising. Already shown in our GLA paper
1
0
10
@SonglinYang4
Songlin Yang
18 days
@bailin_28 @yoonrkim If you are interested in it, I previously made a set of slides introducing FLA and GLA in detail:
1
1
9
@SonglinYang4
Songlin Yang
7 months
@s_mejjoute I had a Triton-based implementation in for those who wanna play with taylor series linear attn
1
1
9
@SonglinYang4
Songlin Yang
8 months
@srush_nlp with data dependent params they are no longer convolutions haha
1
0
9
@SonglinYang4
Songlin Yang
11 months
This blog is AMAZING!
@Si_Boehm
Simon Boehm
2 years
I wrote the most naive CUDA matrix multiply and iteratively optimised it to ~80% of cuBLAS performance:
16
158
1K
1
2
9
@SonglinYang4
Songlin Yang
6 months
Very interesting analyses.
@y0b1byte
yobibyte
9 months
Linear Log-Normal Attention with Unbiased Concentration The authors propose and test a new self-attention mechanism that improves the scalability of transformer models by emulating the original attention's distribution and concentration behavior. 4/n
1
2
10
1
2
8
@SonglinYang4
Songlin Yang
3 months
@francoisfleuret The key idea is to only materialize chunk-level hidden states, using matmuls to calculate outputs based on the query, key, and value matrices and chunk-level hidden states. This method avoids materializing the hidden state for every single timestep.
2
0
8
@SonglinYang4
Songlin Yang
2 months
@fredahshi Congrats Dr Meow
1
0
7
@SonglinYang4
Songlin Yang
11 months
looks exciting
@fly51fly
fly51fly
11 months
[LG] CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra A Potapczynski, M Finzi, G Pleiss, A G Wilson [New York University & CMU & University of British Columbia] (2023) - The paper proposes CoLA (Compositional
Tweet media one
Tweet media two
Tweet media three
Tweet media four
0
4
6
0
0
7
@SonglinYang4
Songlin Yang
7 months
really good slides
0
1
6
@SonglinYang4
Songlin Yang
9 months
3. This work investigates the importance of forget gates for linear RNNs, reviving the Gated Impulse Linear Recurrent Layer proposed in , where forget gates depend only on input data instead of previous hidden state, allowing for parallel training
1
0
6
@SonglinYang4
Songlin Yang
4 months
@lambdaviking @jowenpetty @Ashish_S_AI Why does S6 have convolution forms (Eq4) when mamba already has data-dependent transitions?
1
0
6
@SonglinYang4
Songlin Yang
9 months
1. Current linear rnns often employ the gating mechanism in the output of recurrence layer (e.g., Gated SSM, RWKV), which amounts to output gates in LSTMs and has been shown very effective to improve linear RNN's performance.
1
0
6
@SonglinYang4
Songlin Yang
3 months
Tweet media one
0
0
6
@SonglinYang4
Songlin Yang
7 months
@jacobmbuckman @manifest__ai I was completely motivated by . They demonstrated promising results when increasing the head dimension, but their implementation lacked efficiency for scaling up. I haven't tried any other type of gating yet.
0
0
6
@SonglinYang4
Songlin Yang
3 months
0
0
6
@SonglinYang4
Songlin Yang
6 months
For the context each line is 1.3B model trained for 100B token on Slimpajama. Re. TBPTT we use 12 chunks and 2048 chunk size each.
0
0
6
@SonglinYang4
Songlin Yang
9 months
5. Without such lower bounds, the GILR layer is not sufficient to have good performance on the LRA benchmark. We also include large-scaled language modeling experiments, showing that linear RNNs with data dependent forget gates have the potential to compete with Transformers!
1
0
5
@SonglinYang4
Songlin Yang
1 year
@RespectToX For me RWKV is a gated diagonal state space model
1
0
5
@SonglinYang4
Songlin Yang
6 months
@CFGeek I found it similar to a concurrent work . What surprised me the most was that element-wise exp kernel works after all -- when you have a temperature parameter to make the attn score spiky.
1
0
4
@SonglinYang4
Songlin Yang
1 year
@h2ruk1 恭喜你发现了Unsupervised Tree Induction的真谛(误
0
0
4
@SonglinYang4
Songlin Yang
7 months
@CFGeek This article is also very beautiful :)
0
0
4
@SonglinYang4
Songlin Yang
9 months
I really like this paper :)
@nsaphra
Naomi Saphra
9 months
I did it
3
6
114
0
0
4
@SonglinYang4
Songlin Yang
1 year
@RespectToX R is similar to a gate, and W is similar to eigenvalues.
0
0
4
@SonglinYang4
Songlin Yang
4 months
@MrCatid @lambdaviking @jowenpetty @Ashish_S_AI Ah I don't think HGRN2 and GLA have nonlinearity in the recurrence bcs their recurrence could be parallelized. The gating in the input and output is indeed nonlinear tho
1
0
4
@SonglinYang4
Songlin Yang
7 months
@francoisfleuret @dvruette Hey, just saw the dumb/barrel RNNs – similar to our Gated Linear Attention ()! A few differences in the parameterization, but seems like our insights could apply here too. Interesting stuff!
1
0
4
@SonglinYang4
Songlin Yang
10 months
ah remind me of
@ylecun
Yann LeCun
10 months
Compute is all you need. For a given amount of compute, ViT and ConvNets perform the same. Quote from this DeepMind article: "Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs
82
315
2K
0
0
4
@SonglinYang4
Songlin Yang
5 months
@NekosSelina 太可爱了8
0
0
4
@SonglinYang4
Songlin Yang
7 months
@CFGeek Unsure about the arrow's direction, considering reversing it and rename discretization to de-discretization. Linear RNNs with data-dependent decay and matrix-valued states subsume approaches like selective SSM and linear attention with forget gate
1
0
3
@SonglinYang4
Songlin Yang
7 months
@HanGuo97 Congrats! 🥰
1
0
3
@SonglinYang4
Songlin Yang
5 months
@liliang_ren congrats 🎉
0
0
3
@SonglinYang4
Songlin Yang
7 months
@CFGeek mamba paper has an ablation study on this. for SSMs w/o data-dependent parameters, it doesn't matter, for selective SSMs, expanding dimension brings huge gains.
Tweet media one
0
0
3
@SonglinYang4
Songlin Yang
7 months
@CFGeek Mamba has NxD matrix-valued states
1
0
3
@SonglinYang4
Songlin Yang
1 year
@UndefBehavior Haha, let's turn the parsing track into an algorithm track!
0
0
3
@SonglinYang4
Songlin Yang
7 months
@BlancheMinerva @CFGeek v4 is linear rnn w/o data dependent decay and has convolutional view. v6 is linear rnn w/ data dependent decay and 2d matrix-valued state. v5 is like linear attn + data independent decay, similar to retnet.
0
0
3
@SonglinYang4
Songlin Yang
8 months
@Kagurazaka_L my sense is that data-dependent decay could help reduce the reliance on dimensionality --- both retnet and transnormerLLM will perform much worser when the expansion rate is no greater than 64, while mamba only needs a expansion rate of 16--that will save memory for inference!
0
0
3
@SonglinYang4
Songlin Yang
2 years
@yoavgo Inside-Outside and Forward-Backward Algorithms Are Just Backprop (tutorial paper)
0
0
3
@SonglinYang4
Songlin Yang
8 months
For the context, TNN is very similar to the Hyena model and belongs to gated convolution. HGRN is a linear RNN with data-dependent decay but without hidden dimension expansion, while Mamba includes state expansion
1
0
3
@SonglinYang4
Songlin Yang
8 months
TransnormerLLM is very similar to the RetNet model: both are linear attention models with a fixed decay rate. In this regard, they can be considered as linear RNNs with a very large (two-dimensional) hidden state, yet without data-dependent decay
0
0
3
@SonglinYang4
Songlin Yang
7 months
@__peterwolf__ @manifest__ai had analyses the instability issue of linear xfmrs, suggesting removing the denominator and applying layernorm in the output, which is adopted by RetNet and GLA.
1
0
3
@SonglinYang4
Songlin Yang
6 months
@MrCatid @CFGeek Transnormer, RetNet. Simple, efficient and work.
1
0
2
@SonglinYang4
Songlin Yang
9 months
@realDanFu @arankomatsuzaki @_akhaliq Great work! How's this compared to flashbutterfly?
1
0
2
@SonglinYang4
Songlin Yang
2 years
0
0
2
@SonglinYang4
Songlin Yang
7 months
@CFGeek right right. Linear attention expands channel via outer product, SSM expands channel via the single-input-multiple-output (SIMO) mechanism
1
0
2
@SonglinYang4
Songlin Yang
1 year
@arankomatsuzaki what are 1) multiquery local attn 2) TXL recurrence?
1
0
2
@SonglinYang4
Songlin Yang
1 year
@franz_nowak @ryandcotterell Interesting algorithm!
0
0
2
@SonglinYang4
Songlin Yang
6 months
interesting
@_akhaliq
AK
6 months
Infini-gram Scaling Unbounded n-gram Language Models to a Trillion Tokens demo: paper page: train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing
Tweet media one
8
74
401
0
0
2
@SonglinYang4
Songlin Yang
7 months
@__peterwolf__ @manifest__ai In addition, there is a decay factor in front of the hidden state. So the hidden state norm will not explode over time. Though I didn't try larger context size yet, I don't think instability will be an issue like many early linear transformers.
0
0
2
@SonglinYang4
Songlin Yang
1 year
@UndefBehavior Single author paper is super cool
0
0
2
@SonglinYang4
Songlin Yang
9 months
6. Our largest model HGRN-1B trained on the pile for 100B tokens is available at
0
0
2
@SonglinYang4
Songlin Yang
2 years
R.I.P.
@YiMaTweets
Yi Ma
2 years
I was shocked to know that Dr. Jian Sun, my former colleague of the MSRA Visual Computing Group, has passed away. We will miss him dearly. May his soul rest in peace.
20
59
602
0
0
2
@SonglinYang4
Songlin Yang
1 year
Oh really interesting
@albertobietti
Alberto Bietti
1 year
Recent Transformer LMs are very good at using their context for predicting new tokens. How does this capability arise during training? We study this in our paper "Birth of a Transformer: A Memory Viewpoint" 🐣
Tweet media one
4
51
289
0
1
2
@SonglinYang4
Songlin Yang
1 year
Super interesting
@_akhaliq
AK
1 year
Hyena Hierarchy: Towards Larger Convolutional Language Models propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating abs:
Tweet media one
4
20
119
0
0
2
@SonglinYang4
Songlin Yang
1 year
@dpkingma RWKV can be described as a gated diagonal state space model, but token shift is a crucial addition that sets it apart. This feature is similar to the "shift SSM" found in H3 and short local convolutions in Hyena.
1
0
2