Arnaud Pannatier @ArnaudPannatier Twitter profile | Pikagi

Pikagi

Arnaud Pannatier

@ArnaudPannatier

1,178

Followers

272

Following

19

Media

123

Statuses

PhD Student @francoisfleuret 's Machine Learning Group Idiap Research Institute - @Idiap_ch EDIC EPFL - @EPFL_en

valais

https://t.co/P8ch3dYHbM

Joined August 2010

Don't wanna be here? Send us removal request.

Pinned Tweet

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

We combine these two functionalities into a burst-sampling scheme with rejection. It generates all the remaining tokens in one shot, and keeps only the ones that are self-consistent. This allows sublinear sequence generation similar to diffusion. 5/6

3

11

158

Last Seen Profiles

@Winnie1713325

@bitchforbear

@KingBoylston

@3ke_trhk

@stw46

@T2la_10

@iwtvscript

@ya71742

@OMU_cos

@lostorostoday

@oyeafed

@bokeplokalmalam

@jandakembangstw

@stw46

@TFatai65538

@gelifishhh

@Nur94598278

@IronmanMexico

@heibabu

@SeraFinn0

@crashykk

@stwmaniax

@NetflixKR

@_nyannnyann

@Cryptodill36963

@red_red_red_rrr

@cc_viper

@BitcoinDogecoin

@RoqueNikit49682

@incomodadora

@DesmoneFra2900

@PalierDigital

@zerocat_any

@initokyolagi

@stw_pdg

@izo_e5

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

GPTs are generating sequences in a left-to-right order. Is there another way? With @francoisfleuret and @evanncourdier , in partnership with @SkysoftATM , we developed σ-GPT, capable of generating sequences in any order chosen dynamically at inference time. 1/6

54

247

2K

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

Diffusion models are surprisingly good at solving algorithmic tasks. With @francoisfleuret and @evanncourdier , we use discrete diffusion to find shortest paths in mazes represented as images. 1/5

Tweet media one

23

127

741

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

More detail about this work in our paper, that was just accepted at @ECMLPKDD . Paper: Demo: Code: in preparation.

Tweet card media

σ-GPTs: A New Approach to Autoregressive Models

Autoregressive models, such as the GPT family, use a fixed order, usually left-to-right, to generate sequences. However, this is not a necessity. In this paper, we challenge this assumption and...

9

17

227

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

4 months

The code for σ-GPT is out! Feel free to experiment with it here: It extends the @francoisfleuret picoclvr (non-nlp tasks) and @karpathy nanoGPT (text) codebases, with an additional KV-Cache from IRIS by @micheli_vincent and @EloiAlonso1 . (links below)

3

20

131

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

This opens a multitude of possibilities. Such as: (A) Conditioning on any subset of tokens from a sequence or (B) Estimating the probability of the remaining tokens at once. 3/6

Tweet media one

1

6

116

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

We only need to shuffle the input sequence and to use a double positional encoding, so that the model is informed of what token to generate at each position of the output sequence. 2/6

Tweet media one

2

6

101

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

With the advantages that simple tasks can be fully completed in a few steps. 6/6 (with a diffusion comparison)

1

2

95

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

(A) is done by putting the conditioning tokens “first”, and (B) is done by putting all the remaining tokens as “the next”. The latter requires carving the attention matrix. 4/6

Tweet media one

1

3

71

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

We use discrete diffusion with a U-net. The maze with the start and goal is encoded in one channel, and the model denoises the maze with a solution in another channel. Unet: Discrete diffusion: 3/5

2

5

61

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

It works beautifully with much larger mazes. 4/5

1

0

56

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

To estimate the denoising step p(x_{t-1} | x_t), the algorithm estimates p(x_0 | x_t). Visualizing this estimate during the process (bottom row) shows the "current hypotheses" , that eventually concentrates on the result. 5/5

6

0

53

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@mrsiipa @francoisfleuret @EvannCourdier @SkysoftATM It's in preparation, should be out by end-of-next week (we are preparing it for the camera-ready version of the paper) I'll ping you here once it's out

2

0

40

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

Each maze is generated by adding iteratively horizontal and vertical walls. A starting location and a goal location are picked at random, and one path is sampled uniformly among the shortest paths computed with an exact algorithm from the starting point to the goal. 2/5

Tweet media one

1

0

35

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@samsja19 @francoisfleuret @EvannCourdier Thanks a lot! Surprisingly in our exps LLMs were also able to solve the problem with near perfect accuracy even if we had to flatten inputs in a raster-scan order for the transformers. The only drawback is that it does not scale easily to larger sizes due to their quadratic cost

2

0

28

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

6 months

@TmlrPub Transformers are great at extrapolation from sparse data. In our last TMLR paper, in partnership with @SkysoftATM , @francoisfleuret , Kyle Matoba and I modified a ViT-like model to encode sparse context-data and to generate reliable conditional predictions anywhere in space. 1/5

Tweet media one

1

2

24

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@jstock37 @francoisfleuret @EvannCourdier @SkysoftATM I agree that it can be limiting in a way, but it can also be useful for the model: fixing some tokens might give the overall structure of the text it want to generate, and then it only has to fill in between.

3

0

19

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

3 years

Really proud to announce that our paper on High-Altitude Windnowcasting with @francoisfleuret and R. Picatoste was accepted for publication at #SIAMSDM22 ! Many thanks to the reviewers for their valuable remarks! Pre-print:

Tweet card media

Efficient Wind Speed Nowcasting with GPU-Accelerated Nearest...

This paper proposes a simple yet efficient high-altitude wind nowcasting pipeline. It processes efficiently a vast amount of live data recorded by airplanes over the whole airspace and...

1

5

18

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@Pierre653878019 @francoisfleuret I don't really know about the brain, but it was interesting to see that the model, before even starting the autoregression, as already a good idea of the solution that it wants to generate

Tweet media one

1

1

18

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@cavemanloverboy @francoisfleuret @EvannCourdier @SkysoftATM The ordering is indeed random. It's hard to directly learn an order as we don't have a "ground-truth" for the order. But actually in the rejection sampling scheme there is something along these lines:

7

0

17

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@ChrSzegedy @francoisfleuret @EvannCourdier @SkysoftATM The objective is similar, they both model shuffled sequences. The implementation differs: XLNet use masking and we use a double positional encoding instead, to give the model the position of the next token. The burst-sampling scheme is new. (More details in the related works)

1

0

7

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@Xidong_Feng @francoisfleuret @EvannCourdier @SkysoftATM The maze environment is already out there: see maze (and plenty of other cool experiments)

1

1

7

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@cavemanloverboy @francoisfleuret @EvannCourdier @SkysoftATM We generate all the remaining tokens at the same time and then picks N different orders (N is an hyperparameter) and keep the order which validate the most tokens. So if there is a preferred order it might be selected there

0

0

7

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@oameyus @francoisfleuret @EvannCourdier @SkysoftATM If you need infilling for example. And the burst-sampling with rejection allow for generating the whole sequence in a few steps instead of one token after one.

0

0

7

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

4 months

picoclvr: nanoGPT: IRIS (for KV-Cache, and go check the whole project it's amazing): σ-GPT:

Tweet card media

GitHub - idiap/sigma-gpt: σ-GPT: A New Approach to Autoregressive Models

σ-GPT: A New Approach to Autoregressive Models. Contribute to idiap/sigma-gpt development by creating an account on GitHub.

0

0

7

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

6 months

@TmlrPub @SkysoftATM @francoisfleuret The goal of this project is to exploit live wind measurements from airplanes to generate reliable short term forecasts. As airplanes are moving along sparse flight routes, we need an architecture that can handle unstructured sets of data. 2/5

Tweet media one

1

0

6

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@LogicBombe @francoisfleuret @EvannCourdier @SkysoftATM in all application, it converged to the same performance than a standard GPT, but it cost more compute.

1

0

6

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

6 months

@TmlrPub @SkysoftATM @francoisfleuret You can find more about this in our TMLR publication. Paper: Code: 5/5

Tweet card media

GitHub - idiap/inference-from-real-world-sparse-measurements: Implementation of the Multi-Layer...

Implementation of the Multi-Layer Self-Attention, a state-of-the-art model designed for wind nowcasting tasks - idiap/inference-from-real-world-sparse-measurements

0

0

6

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@ffaebi @francoisfleuret @EvannCourdier @SkysoftATM Here, we don't resample tokens once they are generated. But it is absolutely possible to remove any token you want during generation. It should also be possible to use the model to highlight tokens that are unlikely given the partial generation but we did not investigate it yet.

1

0

6

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@_jasonliu_ @francoisfleuret @EvannCourdier Probably, I guess as for other deep learning tasks, the model will perform better on samples that are close to what it has seen in its training set.

1

1

5

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@lemergenz @samsja19 @francoisfleuret @EvannCourdier Even with Flash Attention, large mazes are 128x192 ≃ 24k tokens which is too much for our GPUs. But it could work with standard techniques (AR in the latent space for example )

0

0

5

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@Walley7777 @francoisfleuret @EvannCourdier They are slower and can sometimes even output the wrong path. But comparing with traditional algorithms is not the point as they are exact and use particular data structures. Here, the model must understand the problem only by looking at pairs of images (empty maze/solved maze).

0

0

5

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@Blueyatagarasu @francoisfleuret @EvannCourdier @SkysoftATM Transformers have great world-models

1

0

4

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

6 months

@TmlrPub @SkysoftATM @francoisfleuret These problems usually use encoder-decoder models. The encoder processes the context measurements and the decoder is then queried with positions and outputs conditional forecasts. We simplified that and used a single encoder stack, leading to a simple and robust model. 3/5

Tweet media one

1

0

4

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 years

À voir : Simulation à 100 corps en javascript basée sur l'algorithme de Barnes-Hut

0

2

4

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

4 months

@dejanseo @mrsiipa @francoisfleuret @EvannCourdier @SkysoftATM The code is out (sorry for the slight delay I've been busy ^^')

Tweet card media

GitHub - idiap/sigma-gpt: σ-GPT: A New Approach to Autoregressive Models

σ-GPT: A New Approach to Autoregressive Models. Contribute to idiap/sigma-gpt development by creating an account on GitHub.

0

0

3

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@aristot_3rd @francoisfleuret @EvannCourdier @SkysoftATM It can generate sequence in any order, so if you want to use it left-to-right it is still possible. Any tricks working for GPT should work here. But even without left-to-right generation, it is still possible to prompt with anything you want (CoT included)

0

0

3

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@JoshPurtell Thanks for the reference, we’ll look into it. Here we did it more in a self-supervised manner than in a RL, but there seems to have similarities with the way models can correct partially incorrect representations

0

0

2

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

7 years

@AbsolvoCom Beau travail ! Chapeau !

0

0

3

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

3 years

@ogrisel @francoisfleuret @_florianmai @FabioFehr @JamieBHenderson Not yet, but we compared both models in the appendix and we will add a comparison in the next version of the paper

0

0

3

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

6 months

@TmlrPub @SkysoftATM @francoisfleuret The reason why this architecture works well is that the attention mechanism allows each query to see the whole context without bottleneck. We showed that this leads to a better gradient flow, which helps during training. 4/5

Tweet media one

1

0

3

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@notnotrishi @ECMLPKDD Sadly no demo yet. Come should be out by end of next week, I'll keep you posted.

0

0

2

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

3 months

@nudelbrot @francoisfleuret @karpathy @micheli_vincent @EloiAlonso1 If you want to gradually mask random cells to whole columns, I would go for a BERT-like model with a custom masking scheme. That being said you could still use σ-GPT with a curriculum scheme that start left-to-right, then switch columns, and then goes full random.

1

0

2

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@IoannisKakogeo1 @francoisfleuret @EvannCourdier @SkysoftATM @SpyrosGidaris Nice! We didn't have a look at the strength of signal back-propagated to the model (your fig.4 is cool). I see that in your set of different permutation you used variants of raster-scan order and spirals. Did you tried completely random orders at some point ?

1

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@PrayagBhakar It's part of @francoisfleuret 's awesome picoclvr suite see maze, and many other cool experiments there descriptions:

0

0

2

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@jonas_eschmann @francoisfleuret @EvannCourdier @SkysoftATM We actually don't as for now. Adding token is indeed possible (you only have to add one to the positional encoding of the rest of the sequence). We didn't study this in this work, but it's an interesting future line of work.

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@st01014 @francoisfleuret On text, we have generations of similar quality if you have enough diffusion steps. The main advantage over diffusion is that the rejection sampling is dynamic: it will do less steps if the generation is simple. More details there

0

0

2

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@roviro_ @francoisfleuret @EvannCourdier Not directly, but models will be slower and can sometimes even output the wrong path. Comparing with traditional algorithms is not the point as they are conceived to be exact. Here, the model must understand the problem only by looking at pairs of images (empty maze/solved maze).

1

0

2

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

3 years

@breuderink @_florianmai @francoisfleuret @FabioFehr @JamieBHenderson Not at the moment but it will be released soon (I will add the link here when it’s done)

1

0

2

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@giffmana @ECMLPKDD Exactly, it is the motivating task of this project: we were trying to model ascent/ descent of airplanes under control in a low-data regime. We got a repetition problem, the models sometimes outputs plateauing sequences at the wrong altitude and the random order help.

1

0

2

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@TuanPham672604 @francoisfleuret We did not specifically test this. But I think it should be directly possible by specifying an <end-of-sequence> token at the position that you want and let the model complete in between.

1

0

2

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@giffmana @ECMLPKDD Since it is a bit niche, we then applied it to more general cases.

1

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@pr0me @francoisfleuret For the text, we had to introduce a curriculum learning scheme where we start giving the model sequences in order and then progressively shuffle. But for the other task no adjustments were needed, just shuffling.

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@Jacoed @francoisfleuret @EvannCourdier Here we used D3PMs

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@NickestOfNicks We are working on an arXiv version of this thread. The full codebase for the discrete diffusion process will be released at the same time. The Unet model we used is the one linked from lucidrains. The maze generation is done with:

1

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@yar_vol @francoisfleuret In this paper, we do not replace tokens. However, it is straightforward to remove manually tokens during sampling. And you can also reevaluate the partially generated sequence under another order and remove tokens as you want (but we don't study this in the paper)

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 years

@MacPaw 23go nettoyés, 2go de libre.. vraiment ? Petit problème de conception ... à régler ! #CleanMyMac3

Tweet media one

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@EmilevanKrieken Exactly, we might try with the absorbing state at some point and see how it affects the results

1

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@albfresco @francoisfleuret The rejection sampling scheme allow to generate sequence in a few steps instead of generating tokens one by one

1

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@basavasagar18 @basavasagar18 in that specific case, we only have 5 « types » of pixels (empty, wall, start, end, path) so the nature of the data is discrete

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

3 months

@nudelbrot @francoisfleuret @karpathy @micheli_vincent @EloiAlonso1 Yeah changing will let you use any kind of ordering. But if you want to be able to change on the fly at generation time, you need to keep a random order during training. But σ-GPT does not use masking like BERT it is causal like a standard GPT.

sigma-gpt/text/order.py at 0287cfd34d2e949707dd0f001ee704eedb3c6c51 · idiap/sigma-gpt

σ-GPT: A New Approach to Autoregressive Models. Contribute to idiap/sigma-gpt development by creating an account on GitHub.

1

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@arnicas We are working on an arXiv version of this thread. We'll release a project site with more examples, along with the full codebase for the discrete diffusion process at the same time.

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@qiqi123_yu @francoisfleuret @EvannCourdier @SkysoftATM you can use it in a fully autoregressive manner or use our burst-sampling scheme which generates multiple tokens at every step.

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@samsja19 @francoisfleuret @EvannCourdier For reference here are some results with GPT-like models trained on the same task :

@francoisfleuret

François Fleuret

@francoisfleuret

1 year

BTW tried bigger mazes. After two epochs it already holds very well. Note that it does not check the length is optimal, only that it is a proper path (continuous, no spurious path pixel).

Tweet media one

2

1

15

1

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

3 years

@francoisfleuret

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 months

@cgarciae88 @francoisfleuret Hey! We are working on an arXiv version of this thread. The full codebase for the discrete diffusion process will be released at the same time. The Unet model we used is the one linked from lucidrains. The maze generation is done with:

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@dr_mike_hammes @francoisfleuret @EvannCourdier @SkysoftATM We did not specifically tested that but indeed we can probably prompt the model at inference time with the <sos> and <eos> at desired position and let the model complete the rest. And for the first part: prompting with <eos> and seeing were in the sequence <eos> is more likely.

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

8 years

@AbsolvoCom À quand la version Android ?! #multiplatforme

1

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@joshua_saxe @francoisfleuret @EvannCourdier @SkysoftATM We initially thought something along these lines: that it would serve as a kind of data-augmentation. But from what we've seen in the experiments it seems that it's the reverse: in a low-data setting, as learning in a shuffled order is harder, the model memorise more the data.

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@Cesarfalante @francoisfleuret @EvannCourdier @SkysoftATM We did not explore it but it is indeed a very interesting idea.

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@ffaebi @francoisfleuret @EvannCourdier @SkysoftATM Here, we don't resample tokens once they are generated. But it is absolutely possible to remove any token you want during generation. It should also be possible to use the model to highlight tokens that are unlikely given the partial generation but we did not investigate it yet.

1

0

6

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@fouriergalois @francoisfleuret @EvannCourdier @SkysoftATM We didn't. We thought about it, but as training in a random order needs more compute than in the left-to-right order, it would probably still require a good amount of compute which was too much for us. And we would still need to use double pos. enc w/o breaking the base model.

0

0

1

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

4 months

@OfirPress Better late than never!

@ArnaudPannatier

Arnaud Pannatier

@ArnaudPannatier

5 months

@PrayagBhakar It's part of @francoisfleuret 's awesome picoclvr suite see maze, and many other cool experiments there descriptions:

0

0

2

0

0

1