Mahan Fathi @MahanFathi Twitter profile

Last Seen Profiles

@CheryZik

@Djbrt0

@Goldeltete

@papagrims

@sho_kiseki

@0xadon

@nagaram_beniwal

@nsmith694

@Charly_Arg_Ok

@puppyluv779

@Mohamedfadel69

@ViveLaBibendum

@The_FlyLord

@carlblom_robert

@yesridhi

@CMKnightsWBB

@aryanator1

@sakazta

@hohoboy223

@M_188888

@zai1994_zairi

@SezginSahiin

@GillonGantt

@schaturvedi194

@theScore

@KEENETIC_WIFI

@tieguo19

@Only_Bookstan

@StineyRumDitty

@19Marfin

@kelmerica78

@sethbunchonmbrs

@ACPressMulranen

@dominicrfani

@KilicEmre1903

@CarlesSans2

Mahan Fathi

@MahanFathi

24 days

life update: thrilled to announce that i’ll be joining @nvidia as a research scientist on the alignment team. grateful for the support from mentors and peers. this is a dream come true for both the researcher and the gamer in me!

34

5

425

Mahan Fathi

@MahanFathi

10 months

Why not get the best of both worlds by combining SSMs and Transformers? Excited to share our work at #NeurIPS2023 : "Block-State Transformers." BST hits new highs in long-range language modeling and LRA tasks. paper: 1/

8

64

387

Mahan Fathi

@MahanFathi

8 months

Course Correcting Koopman Representations Accepted at #ICLR2024 ! We identify problems with unrolling in imagination and propose an unconventional, simple, yet effective solution: periodically "𝒓𝒆𝒆𝒏𝒄𝒐𝒅𝒊𝒏𝒈" the latent. 📄 @GoogleDeepMind 1/🧵

4

19

94

Mahan Fathi

@MahanFathi

1 year

@AvivTamar1 :)

3

1

47

Mahan Fathi

@MahanFathi

10 months

Think of BST as a sequel to "Block-Recurrent Transformers." () In BRT, the transformer blocks talked to each other in a slow, sequential way, via recurrence. The catch? Blocks down the line had to wait their turn to process info. BST changes that game. 2/

Block-Recurrent Transformers

We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent...

arxiv.org

1

0

15

Mahan Fathi

@MahanFathi

10 months

With BST, we keep the core action of the transformer blocks in the vertical direction, but get rid of the slow recurrence by letting SSMs step in to offer a set of 'context states' to the transformers. The context can be maintained in different (cool) ways! 3/

1

11

Mahan Fathi

@MahanFathi

10 months

to @iclr_conf reviewers who failed to acknowledge the rebuttals in time, 𝘁𝗵𝗲𝗿𝗲 𝗶𝘀 𝘀𝘁𝗶𝗹𝗹 𝘁𝗶𝗺𝗲. you can revise your review/score and include a post-rebuttal section. i know this is common knowledge but a friendly reminder never hurts. #iclr2024

0

11

Mahan Fathi

@MahanFathi

10 months

I won't be attending NeurIPS in person this year, but Jonathan and Ross will hold the fort. So catch our poster presentation on Wed 13 Dec 6 p.m. EST — 8 p.m. EST. P.S. The Hippo mascot comes from: 10/10

0

2

10

Mahan Fathi

@MahanFathi

10 months

@XinyiWang98 same here. received a last-minute response at 5am EST (!) with questions already addressed in the rebuttal. the overall quality and responsiveness of the reviewers have been a huge disappointment.

0

8

Mahan Fathi

@MahanFathi

10 months

This was joint work b/w @GoogleDeepMind and @Mila_Quebec , with Jonathan Pilault, @orf_bnw , @chrisjpal , @pierrelux , and Ross Goroshin. 9/

1

0

6

Mahan Fathi

@MahanFathi

10 months

Important ablations: 1. The more SSM layers we integrate into the architecture, the better the model gets. 2. BST generalizes with sequence length! We train on 4k, but can go well-above it at inference time. 3. It scales well. We tried up to >1.3B parameters. 7/

1

0

7

Mahan Fathi

@MahanFathi

10 months

Through this, we get 10x speedup compared to BRT at layer level, all the while improving on perplexity! 4/

1

0

8

Mahan Fathi

@MahanFathi

10 months

In addition to the above arch, Single-Head (SH), we also introduce Multi-Head (MH) and Multi-Filter (MF) variations. In a nutshell, MH plays on how the attention heads interact with the context, while MF focuses on extracting "diverse" sequence context. deets in the paper! 5/

1

0

5

Mahan Fathi

@MahanFathi

8 months

To extract rich, meaningful representations of a dynamical system, one can constrain the codes to be linearly spaced out in the latent space, \cite{Koopman}. This assumption gives rise to a model arch, resembling that of a linear RNN, with only one layer of recurrence. 2/

1

0

5

Mahan Fathi

@MahanFathi

10 months

We have the flexibility to interleave SSMs with transformer layers using a blend of the above techniques, however in our experiments, we maintain a unified architecture within each model, e.g. MF-only or SH-only. 6/

1

0

4

Mahan Fathi

@MahanFathi

6 months

@artemZholus @GoogleDeepMind @RGoroshin Congrats! @RGoroshin 's the best of the best!

1

0

4

Mahan Fathi

@MahanFathi

10 months

We tested BST in the Long-Range Arena (LRA), and the results were impressive. It often outperformed other baselines, and yes, that sometimes even includes pure SSM models! 8/

1

0

4

Mahan Fathi

@MahanFathi

8 months

This was joint work between @GoogleDeepMind and @Mila_Quebec . Many thanks to my supervisors @RGoroshin and @pierrelux for their constant support and guidance throughout the project. Also props to @ClementGehring , @J_Pilault and @davidkanaa . See you in Vienna! ❤️ 14/14

1

0

3

Mahan Fathi

@MahanFathi

8 months

Cool. Now that we have a trained model, we should be able to take an initial condition as input (x), encode it to get the first latent (z), keep hitting (z) with (K) to get future (z)'s, and then decode everything back to (x). Let's try that out on a few dynamical systems. 5/

1

0

3

Mahan Fathi

@MahanFathi

8 months

A simple observation from the above plot is that the trajectory lines *cross* and this violates the first principles of an autonomous dynamical system. We know that (z) trajectories are faithful, and don't cross. Why do all of a sudden we get this behavior in (x) space? 7/

1

0

3

Mahan Fathi

@MahanFathi

8 months

Often times we assume the code has a bounded but really large dimension (n >> d) and hope to approximately linearize the latent space. The encoder (ϕ), the decoder (ψ), and the matrix (K) are then trained using a reconstruction and a linearity loss, over a given horizon. 4/

1

0

3

Mahan Fathi

@MahanFathi

8 months

We can form a loop by going from (z) to (x) at every unrolling step, and then back to (z). We call this "reencoding," achieved by calling the decoder and the encoder function over (z): (ϕ◦ψ(z)). 9/

1

0

3

Mahan Fathi

@MahanFathi

23 days

@JesseFarebro @nvidia Thanks Jesse!! 🙌🙌

0

3

Mahan Fathi

@MahanFathi

8 months

We have more theory and experiments in the paper, including higher-dim systems like MuJoCo environments (with control inputs!). Periodic reencoding always leads to (big) improvements, only at the cost of introducing one inference-time hyperparam, the reencoding period. 13/

1

0

3

Mahan Fathi

@MahanFathi

8 months

The idea of linear representations is related to the Koopman theory, where the existence of the codespace subject to the linearity condition is guaranteed, albeit provided an infinite-dimensional, latent space. I know that was a mouthful, but the model is pretty simple. 3/

1

0

3

Mahan Fathi

@MahanFathi

8 months

This method produces stable, accurate, long-range future state predictions while being fairly robust to the reencoding period, i.e. the number of steps taken in latent space before reencoding happens. `reencode @ 0` -> no reencoding `reencode @ 1` -> every-step reencoding 12/

1

0

2

Mahan Fathi

@MahanFathi

24 days

@hugo_larochelle @nvidia Thanks, Hugo!! 🙌🙌

0

2

Mahan Fathi

@MahanFathi

8 months

Fruit for thought: this is a bit weird because we expect the encoder and the decoder to be the inverses of one another, but they're not (why?). Unrolling the model this way, by "reencoding at every step," also results in poor performance, but at least w/o crossing behavior. 10/

1

0

2

Mahan Fathi

@MahanFathi

8 months

There are 2 reasons for this. R1. We are modeling a closed system, with an open system. The original DS has the form (x' = f(x)), which forms a feedback loop. That "loop" is missing here. R2. The mapping from (z) to (x), i.e. the decoder, is non-injective, since (n > d)! 8/

1

0

2

Mahan Fathi

@MahanFathi

8 months

So far we have found out that 1) reencoding is necessary, and 2) it introduces its own error. We have discovered an effective tool, although imperfect. So let's use it in moderation. Enter "Periodic Reencoding!" Here we reencode the representations every so often (k steps). 11/

1

0

2

Mahan Fathi

@MahanFathi

8 months

Here we train the model on the Duffing Oscillator system and look at the phase plots generated by unrolling the model using the above method. Well, things seem a bit off here. 6/

1

0

2

Mahan Fathi

@MahanFathi

23 days

@pcastr @nvidia Thanks, Pablo!! 🙌🙌

0

1

Mahan Fathi

@MahanFathi

24 days

@reza_byt_en @nvidia Thanks Reza!! 🤗

0

1

Mahan Fathi

@MahanFathi

24 days

@adityapuranik99 @nvidia Absolutely, thank you!

0

1

Mahan Fathi

@MahanFathi

24 days

@dythui @nvidia thanks, David! i really appreciate the support!

0

1

Mahan Fathi

@MahanFathi

1 year

@MaxMa1987 @arankomatsuzaki Thank you for bringing this to our attention. We'll make sure to read your paper and acknowledge your work if we find it relevant to ours. Frankly, as the practice of chunkifying the outputs of the SSMs is a common approach, we don't consider it to be the primary contribution 1/2

1

0

1

Mahan Fathi

@MahanFathi

23 days

@johanobandoc @nvidia Thanks Johan!! 🤗🤗❤️

0

1

Mahan Fathi

@MahanFathi

24 days