Mahan Fathi Profile Banner
Mahan Fathi Profile
Mahan Fathi

@MahanFathi

725
Followers
116
Following
18
Media
63
Statuses

research @nvidia 👁️; ex @mila_quebec , @googledeepmind & @google 🧠.

Montréal, QC
Joined June 2011
Don't wanna be here? Send us removal request.
@MahanFathi
Mahan Fathi
24 days
life update: thrilled to announce that i’ll be joining @nvidia as a research scientist on the alignment team. grateful for the support from mentors and peers. this is a dream come true for both the researcher and the gamer in me!
34
5
425
@MahanFathi
Mahan Fathi
10 months
Why not get the best of both worlds by combining SSMs and Transformers? Excited to share our work at #NeurIPS2023 : "Block-State Transformers." BST hits new highs in long-range language modeling and LRA tasks. paper: 1/
Tweet media one
8
64
387
@MahanFathi
Mahan Fathi
8 months
Course Correcting Koopman Representations Accepted at #ICLR2024 ! We identify problems with unrolling in imagination and propose an unconventional, simple, yet effective solution: periodically "𝒓𝒆𝒆𝒏𝒄𝒐𝒅𝒊𝒏𝒈" the latent. 📄 @GoogleDeepMind 1/🧵
Tweet media one
4
19
94
@MahanFathi
Mahan Fathi
1 year
Tweet media one
3
1
47
@MahanFathi
Mahan Fathi
10 months
Think of BST as a sequel to "Block-Recurrent Transformers." () In BRT, the transformer blocks talked to each other in a slow, sequential way, via recurrence. The catch? Blocks down the line had to wait their turn to process info. BST changes that game. 2/
1
0
15
@MahanFathi
Mahan Fathi
10 months
With BST, we keep the core action of the transformer blocks in the vertical direction, but get rid of the slow recurrence by letting SSMs step in to offer a set of 'context states' to the transformers. The context can be maintained in different (cool) ways! 3/
Tweet media one
1
1
11
@MahanFathi
Mahan Fathi
10 months
to @iclr_conf reviewers who failed to acknowledge the rebuttals in time, 𝘁𝗵𝗲𝗿𝗲 𝗶𝘀 𝘀𝘁𝗶𝗹𝗹 𝘁𝗶𝗺𝗲. you can revise your review/score and include a post-rebuttal section. i know this is common knowledge but a friendly reminder never hurts. #iclr2024
0
0
11
@MahanFathi
Mahan Fathi
10 months
I won't be attending NeurIPS in person this year, but Jonathan and Ross will hold the fort. So catch our poster presentation on Wed 13 Dec 6 p.m. EST — 8 p.m. EST. P.S. The Hippo mascot comes from: 10/10
0
2
10
@MahanFathi
Mahan Fathi
10 months
@XinyiWang98 same here. received a last-minute response at 5am EST (!) with questions already addressed in the rebuttal. the overall quality and responsiveness of the reviewers have been a huge disappointment.
0
0
8
@MahanFathi
Mahan Fathi
10 months
This was joint work b/w @GoogleDeepMind and @Mila_Quebec , with Jonathan Pilault, @orf_bnw , @chrisjpal , @pierrelux , and Ross Goroshin. 9/
1
0
6
@MahanFathi
Mahan Fathi
10 months
Important ablations: 1. The more SSM layers we integrate into the architecture, the better the model gets. 2. BST generalizes with sequence length! We train on 4k, but can go well-above it at inference time. 3. It scales well. We tried up to >1.3B parameters. 7/
Tweet media one
Tweet media two
Tweet media three
1
0
7
@MahanFathi
Mahan Fathi
10 months
Through this, we get 10x speedup compared to BRT at layer level, all the while improving on perplexity! 4/
Tweet media one
1
0
8
@MahanFathi
Mahan Fathi
10 months
In addition to the above arch, Single-Head (SH), we also introduce Multi-Head (MH) and Multi-Filter (MF) variations. In a nutshell, MH plays on how the attention heads interact with the context, while MF focuses on extracting "diverse" sequence context. deets in the paper! 5/
Tweet media one
1
0
5
@MahanFathi
Mahan Fathi
8 months
To extract rich, meaningful representations of a dynamical system, one can constrain the codes to be linearly spaced out in the latent space, \cite{Koopman}. This assumption gives rise to a model arch, resembling that of a linear RNN, with only one layer of recurrence. 2/
1
0
5
@MahanFathi
Mahan Fathi
10 months
We have the flexibility to interleave SSMs with transformer layers using a blend of the above techniques, however in our experiments, we maintain a unified architecture within each model, e.g. MF-only or SH-only. 6/
Tweet media one
1
0
4
@MahanFathi
Mahan Fathi
6 months
1
0
4
@MahanFathi
Mahan Fathi
10 months
We tested BST in the Long-Range Arena (LRA), and the results were impressive. It often outperformed other baselines, and yes, that sometimes even includes pure SSM models! 8/
Tweet media one
1
0
4
@MahanFathi
Mahan Fathi
8 months
This was joint work between @GoogleDeepMind and @Mila_Quebec . Many thanks to my supervisors @RGoroshin and @pierrelux for their constant support and guidance throughout the project. Also props to @ClementGehring , @J_Pilault and @davidkanaa . See you in Vienna! ❤️ 14/14
1
0
3
@MahanFathi
Mahan Fathi
8 months
Cool. Now that we have a trained model, we should be able to take an initial condition as input (x), encode it to get the first latent (z), keep hitting (z) with (K) to get future (z)'s, and then decode everything back to (x). Let's try that out on a few dynamical systems. 5/
Tweet media one
1
0
3
@MahanFathi
Mahan Fathi
8 months
A simple observation from the above plot is that the trajectory lines *cross* and this violates the first principles of an autonomous dynamical system. We know that (z) trajectories are faithful, and don't cross. Why do all of a sudden we get this behavior in (x) space? 7/
Tweet media one
1
0
3
@MahanFathi
Mahan Fathi
8 months
Often times we assume the code has a bounded but really large dimension (n >> d) and hope to approximately linearize the latent space. The encoder (ϕ), the decoder (ψ), and the matrix (K) are then trained using a reconstruction and a linearity loss, over a given horizon. 4/
Tweet media one
1
0
3
@MahanFathi
Mahan Fathi
8 months
We can form a loop by going from (z) to (x) at every unrolling step, and then back to (z). We call this "reencoding," achieved by calling the decoder and the encoder function over (z): (ϕ◦ψ(z)). 9/
Tweet media one
1
0
3
@MahanFathi
Mahan Fathi
23 days
@JesseFarebro @nvidia Thanks Jesse!! 🙌🙌
0
0
3
@MahanFathi
Mahan Fathi
8 months
We have more theory and experiments in the paper, including higher-dim systems like MuJoCo environments (with control inputs!). Periodic reencoding always leads to (big) improvements, only at the cost of introducing one inference-time hyperparam, the reencoding period. 13/
Tweet media one
Tweet media two
1
0
3
@MahanFathi
Mahan Fathi
8 months
The idea of linear representations is related to the Koopman theory, where the existence of the codespace subject to the linearity condition is guaranteed, albeit provided an infinite-dimensional, latent space. I know that was a mouthful, but the model is pretty simple. 3/
1
0
3
@MahanFathi
Mahan Fathi
8 months
This method produces stable, accurate, long-range future state predictions while being fairly robust to the reencoding period, i.e. the number of steps taken in latent space before reencoding happens. `reencode @ 0` -> no reencoding `reencode @ 1` -> every-step reencoding 12/
Tweet media one
1
0
2
@MahanFathi
Mahan Fathi
24 days
@hugo_larochelle @nvidia Thanks, Hugo!! 🙌🙌
0
0
2
@MahanFathi
Mahan Fathi
8 months
Fruit for thought: this is a bit weird because we expect the encoder and the decoder to be the inverses of one another, but they're not (why?). Unrolling the model this way, by "reencoding at every step," also results in poor performance, but at least w/o crossing behavior. 10/
Tweet media one
1
0
2
@MahanFathi
Mahan Fathi
8 months
There are 2 reasons for this. R1. We are modeling a closed system, with an open system. The original DS has the form (x' = f(x)), which forms a feedback loop. That "loop" is missing here. R2. The mapping from (z) to (x), i.e. the decoder, is non-injective, since (n > d)! 8/
1
0
2
@MahanFathi
Mahan Fathi
8 months
So far we have found out that 1) reencoding is necessary, and 2) it introduces its own error. We have discovered an effective tool, although imperfect. So let's use it in moderation. Enter "Periodic Reencoding!" Here we reencode the representations every so often (k steps). 11/
Tweet media one
1
0
2
@MahanFathi
Mahan Fathi
8 months
Here we train the model on the Duffing Oscillator system and look at the phase plots generated by unrolling the model using the above method. Well, things seem a bit off here. 6/
Tweet media one
1
0
2
@MahanFathi
Mahan Fathi
23 days
@pcastr @nvidia Thanks, Pablo!! 🙌🙌
0
0
1
@MahanFathi
Mahan Fathi
24 days
@reza_byt_en @nvidia Thanks Reza!! 🤗
0
0
1
@MahanFathi
Mahan Fathi
24 days
@adityapuranik99 @nvidia Absolutely, thank you!
0
0
1
@MahanFathi
Mahan Fathi
24 days
@dythui @nvidia thanks, David! i really appreciate the support!
0
0
1
@MahanFathi
Mahan Fathi
1 year
@MaxMa1987 @arankomatsuzaki Thank you for bringing this to our attention. We'll make sure to read your paper and acknowledge your work if we find it relevant to ours. Frankly, as the practice of chunkifying the outputs of the SSMs is a common approach, we don't consider it to be the primary contribution 1/2
1
0
1
@MahanFathi
Mahan Fathi
23 days
@johanobandoc @nvidia Thanks Johan!! 🤗🤗❤️
0
0
1
@MahanFathi
Mahan Fathi
24 days
@AmrMAlameen @nvidia Thanks, Amr!! 🫂
0
0
1
@MahanFathi
Mahan Fathi
8 months
0
0
1