Arnaud Pannatier Profile Banner
Arnaud Pannatier Profile
Arnaud Pannatier

@ArnaudPannatier

1,178
Followers
272
Following
19
Media
123
Statuses

PhD Student @francoisfleuret 's Machine Learning Group Idiap Research Institute - @Idiap_ch EDIC EPFL - @EPFL_en

valais
Joined August 2010
Don't wanna be here? Send us removal request.
Pinned Tweet
@ArnaudPannatier
Arnaud Pannatier
5 months
We combine these two functionalities into a burst-sampling scheme with rejection. It generates all the remaining tokens in one shot, and keeps only the ones that are self-consistent. This allows sublinear sequence generation similar to diffusion. 5/6
3
11
158
@ArnaudPannatier
Arnaud Pannatier
5 months
GPTs are generating sequences in a left-to-right order. Is there another way? With @francoisfleuret and @evanncourdier , in partnership with @SkysoftATM , we developed σ-GPT, capable of generating sequences in any order chosen dynamically at inference time. 1/6
54
247
2K
@ArnaudPannatier
Arnaud Pannatier
8 months
Diffusion models are surprisingly good at solving algorithmic tasks. With @francoisfleuret and @evanncourdier , we use discrete diffusion to find shortest paths in mazes represented as images. 1/5
Tweet media one
23
127
741
@ArnaudPannatier
Arnaud Pannatier
4 months
The code for σ-GPT is out! Feel free to experiment with it here: It extends the @francoisfleuret picoclvr (non-nlp tasks) and @karpathy nanoGPT (text) codebases, with an additional KV-Cache from IRIS by @micheli_vincent and @EloiAlonso1 . (links below)
3
20
131
@ArnaudPannatier
Arnaud Pannatier
5 months
This opens a multitude of possibilities. Such as: (A) Conditioning on any subset of tokens from a sequence or (B) Estimating the probability of the remaining tokens at once. 3/6
Tweet media one
1
6
116
@ArnaudPannatier
Arnaud Pannatier
5 months
We only need to shuffle the input sequence and to use a double positional encoding, so that the model is informed of what token to generate at each position of the output sequence. 2/6
Tweet media one
2
6
101
@ArnaudPannatier
Arnaud Pannatier
5 months
With the advantages that simple tasks can be fully completed in a few steps. 6/6 (with a diffusion comparison)
1
2
95
@ArnaudPannatier
Arnaud Pannatier
5 months
(A) is done by putting the conditioning tokens “first”, and (B) is done by putting all the remaining tokens as “the next”. The latter requires carving the attention matrix. 4/6
Tweet media one
1
3
71
@ArnaudPannatier
Arnaud Pannatier
8 months
We use discrete diffusion with a U-net. The maze with the start and goal is encoded in one channel, and the model denoises the maze with a solution in another channel. Unet: Discrete diffusion: 3/5
2
5
61
@ArnaudPannatier
Arnaud Pannatier
8 months
It works beautifully with much larger mazes. 4/5
1
0
56
@ArnaudPannatier
Arnaud Pannatier
8 months
To estimate the denoising step p(x_{t-1} | x_t), the algorithm estimates p(x_0 | x_t). Visualizing this estimate during the process (bottom row) shows the "current hypotheses" , that eventually concentrates on the result. 5/5
6
0
53
@ArnaudPannatier
Arnaud Pannatier
5 months
@mrsiipa @francoisfleuret @EvannCourdier @SkysoftATM It's in preparation, should be out by end-of-next week (we are preparing it for the camera-ready version of the paper) I'll ping you here once it's out
2
0
40
@ArnaudPannatier
Arnaud Pannatier
8 months
Each maze is generated by adding iteratively horizontal and vertical walls. A starting location and a goal location are picked at random, and one path is sampled uniformly among the shortest paths computed with an exact algorithm from the starting point to the goal. 2/5
Tweet media one
1
0
35
@ArnaudPannatier
Arnaud Pannatier
8 months
@samsja19 @francoisfleuret @EvannCourdier Thanks a lot! Surprisingly in our exps LLMs were also able to solve the problem with near perfect accuracy even if we had to flatten inputs in a raster-scan order for the transformers. The only drawback is that it does not scale easily to larger sizes due to their quadratic cost
2
0
28
@ArnaudPannatier
Arnaud Pannatier
6 months
@TmlrPub Transformers are great at extrapolation from sparse data. In our last TMLR paper, in partnership with @SkysoftATM , @francoisfleuret , Kyle Matoba and I modified a ViT-like model to encode sparse context-data and to generate reliable conditional predictions anywhere in space. 1/5
Tweet media one
1
2
24
@ArnaudPannatier
Arnaud Pannatier
5 months
@jstock37 @francoisfleuret @EvannCourdier @SkysoftATM I agree that it can be limiting in a way, but it can also be useful for the model: fixing some tokens might give the overall structure of the text it want to generate, and then it only has to fill in between.
3
0
19
@ArnaudPannatier
Arnaud Pannatier
3 years
Really proud to announce that our paper on High-Altitude Windnowcasting with @francoisfleuret and R. Picatoste was accepted for publication at #SIAMSDM22 ! Many thanks to the reviewers for their valuable remarks! Pre-print:
1
5
18
@ArnaudPannatier
Arnaud Pannatier
5 months
@Pierre653878019 @francoisfleuret I don't really know about the brain, but it was interesting to see that the model, before even starting the autoregression, as already a good idea of the solution that it wants to generate
Tweet media one
1
1
18
@ArnaudPannatier
Arnaud Pannatier
5 months
@cavemanloverboy @francoisfleuret @EvannCourdier @SkysoftATM The ordering is indeed random. It's hard to directly learn an order as we don't have a "ground-truth" for the order. But actually in the rejection sampling scheme there is something along these lines:
7
0
17
@ArnaudPannatier
Arnaud Pannatier
5 months
@ChrSzegedy @francoisfleuret @EvannCourdier @SkysoftATM The objective is similar, they both model shuffled sequences. The implementation differs: XLNet use masking and we use a double positional encoding instead, to give the model the position of the next token. The burst-sampling scheme is new. (More details in the related works)
1
0
7
@ArnaudPannatier
Arnaud Pannatier
5 months
@Xidong_Feng @francoisfleuret @EvannCourdier @SkysoftATM The maze environment is already out there: see maze (and plenty of other cool experiments)
1
1
7
@ArnaudPannatier
Arnaud Pannatier
5 months
@cavemanloverboy @francoisfleuret @EvannCourdier @SkysoftATM We generate all the remaining tokens at the same time and then picks N different orders (N is an hyperparameter) and keep the order which validate the most tokens. So if there is a preferred order it might be selected there
0
0
7
@ArnaudPannatier
Arnaud Pannatier
5 months
@oameyus @francoisfleuret @EvannCourdier @SkysoftATM If you need infilling for example. And the burst-sampling with rejection allow for generating the whole sequence in a few steps instead of one token after one.
0
0
7
@ArnaudPannatier
Arnaud Pannatier
6 months
@TmlrPub @SkysoftATM @francoisfleuret The goal of this project is to exploit live wind measurements from airplanes to generate reliable short term forecasts. As airplanes are moving along sparse flight routes, we need an architecture that can handle unstructured sets of data. 2/5
Tweet media one
1
0
6
@ArnaudPannatier
Arnaud Pannatier
5 months
@LogicBombe @francoisfleuret @EvannCourdier @SkysoftATM in all application, it converged to the same performance than a standard GPT, but it cost more compute.
1
0
6
@ArnaudPannatier
Arnaud Pannatier
5 months
@ffaebi @francoisfleuret @EvannCourdier @SkysoftATM Here, we don't resample tokens once they are generated. But it is absolutely possible to remove any token you want during generation. It should also be possible to use the model to highlight tokens that are unlikely given the partial generation but we did not investigate it yet.
1
0
6
@ArnaudPannatier
Arnaud Pannatier
8 months
@_jasonliu_ @francoisfleuret @EvannCourdier Probably, I guess as for other deep learning tasks, the model will perform better on samples that are close to what it has seen in its training set.
1
1
5
@ArnaudPannatier
Arnaud Pannatier
8 months
@lemergenz @samsja19 @francoisfleuret @EvannCourdier Even with Flash Attention, large mazes are 128x192 ≃ 24k tokens which is too much for our GPUs. But it could work with standard techniques (AR in the latent space for example )
0
0
5
@ArnaudPannatier
Arnaud Pannatier
8 months
@Walley7777 @francoisfleuret @EvannCourdier They are slower and can sometimes even output the wrong path. But comparing with traditional algorithms is not the point as they are exact and use particular data structures. Here, the model must understand the problem only by looking at pairs of images (empty maze/solved maze).
0
0
5
@ArnaudPannatier
Arnaud Pannatier
5 months
1
0
4
@ArnaudPannatier
Arnaud Pannatier
6 months
@TmlrPub @SkysoftATM @francoisfleuret These problems usually use encoder-decoder models. The encoder processes the context measurements and the decoder is then queried with positions and outputs conditional forecasts. We simplified that and used a single encoder stack, leading to a simple and robust model. 3/5
Tweet media one
1
0
4
@ArnaudPannatier
Arnaud Pannatier
8 years
À voir : Simulation à 100 corps en javascript basée sur l'algorithme de Barnes-Hut
0
2
4
@ArnaudPannatier
Arnaud Pannatier
5 months
@aristot_3rd @francoisfleuret @EvannCourdier @SkysoftATM It can generate sequence in any order, so if you want to use it left-to-right it is still possible. Any tricks working for GPT should work here. But even without left-to-right generation, it is still possible to prompt with anything you want (CoT included)
0
0
3
@ArnaudPannatier
Arnaud Pannatier
8 months
@JoshPurtell Thanks for the reference, we’ll look into it. Here we did it more in a self-supervised manner than in a RL, but there seems to have similarities with the way models can correct partially incorrect representations
0
0
2
@ArnaudPannatier
Arnaud Pannatier
7 years
@AbsolvoCom Beau travail ! Chapeau !
0
0
3
@ArnaudPannatier
Arnaud Pannatier
3 years
@ogrisel @francoisfleuret @_florianmai @FabioFehr @JamieBHenderson Not yet, but we compared both models in the appendix and we will add a comparison in the next version of the paper
0
0
3
@ArnaudPannatier
Arnaud Pannatier
6 months
@TmlrPub @SkysoftATM @francoisfleuret The reason why this architecture works well is that the attention mechanism allows each query to see the whole context without bottleneck. We showed that this leads to a better gradient flow, which helps during training. 4/5
Tweet media one
1
0
3
@ArnaudPannatier
Arnaud Pannatier
5 months
@notnotrishi @ECMLPKDD Sadly no demo yet. Come should be out by end of next week, I'll keep you posted.
0
0
2
@ArnaudPannatier
Arnaud Pannatier
3 months
@nudelbrot @francoisfleuret @karpathy @micheli_vincent @EloiAlonso1 If you want to gradually mask random cells to whole columns, I would go for a BERT-like model with a custom masking scheme. That being said you could still use σ-GPT with a curriculum scheme that start left-to-right, then switch columns, and then goes full random.
1
0
2
@ArnaudPannatier
Arnaud Pannatier
5 months
@IoannisKakogeo1 @francoisfleuret @EvannCourdier @SkysoftATM @SpyrosGidaris Nice! We didn't have a look at the strength of signal back-propagated to the model (your fig.4 is cool). I see that in your set of different permutation you used variants of raster-scan order and spirals. Did you tried completely random orders at some point ?
1
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@PrayagBhakar It's part of @francoisfleuret 's awesome picoclvr suite see maze, and many other cool experiments there descriptions:
0
0
2
@ArnaudPannatier
Arnaud Pannatier
5 months
@jonas_eschmann @francoisfleuret @EvannCourdier @SkysoftATM We actually don't as for now. Adding token is indeed possible (you only have to add one to the positional encoding of the rest of the sequence). We didn't study this in this work, but it's an interesting future line of work.
0
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@st01014 @francoisfleuret On text, we have generations of similar quality if you have enough diffusion steps. The main advantage over diffusion is that the rejection sampling is dynamic: it will do less steps if the generation is simple. More details there
0
0
2
@ArnaudPannatier
Arnaud Pannatier
8 months
@roviro_ @francoisfleuret @EvannCourdier Not directly, but models will be slower and can sometimes even output the wrong path. Comparing with traditional algorithms is not the point as they are conceived to be exact. Here, the model must understand the problem only by looking at pairs of images (empty maze/solved maze).
1
0
2
@ArnaudPannatier
Arnaud Pannatier
3 years
@breuderink @_florianmai @francoisfleuret @FabioFehr @JamieBHenderson Not at the moment but it will be released soon (I will add the link here when it’s done)
1
0
2
@ArnaudPannatier
Arnaud Pannatier
5 months
@giffmana @ECMLPKDD Exactly, it is the motivating task of this project: we were trying to model ascent/ descent of airplanes under control in a low-data regime. We got a repetition problem, the models sometimes outputs plateauing sequences at the wrong altitude and the random order help.
1
0
2
@ArnaudPannatier
Arnaud Pannatier
5 months
@TuanPham672604 @francoisfleuret We did not specifically test this. But I think it should be directly possible by specifying an <end-of-sequence> token at the position that you want and let the model complete in between.
1
0
2
@ArnaudPannatier
Arnaud Pannatier
5 months
@giffmana @ECMLPKDD Since it is a bit niche, we then applied it to more general cases.
1
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@pr0me @francoisfleuret For the text, we had to introduce a curriculum learning scheme where we start giving the model sequences in order and then progressively shuffle. But for the other task no adjustments were needed, just shuffling.
0
0
1
@ArnaudPannatier
Arnaud Pannatier
8 months
0
0
1
@ArnaudPannatier
Arnaud Pannatier
8 months
@NickestOfNicks We are working on an arXiv version of this thread. The full codebase for the discrete diffusion process will be released at the same time. The Unet model we used is the one linked from lucidrains. The maze generation is done with:
1
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@yar_vol @francoisfleuret In this paper, we do not replace tokens. However, it is straightforward to remove manually tokens during sampling. And you can also reevaluate the partially generated sequence under another order and remove tokens as you want (but we don't study this in the paper)
0
0
1
@ArnaudPannatier
Arnaud Pannatier
8 years
@MacPaw 23go nettoyés, 2go de libre.. vraiment ? Petit problème de conception ... à régler ! #CleanMyMac3
Tweet media one
0
0
1
@ArnaudPannatier
Arnaud Pannatier
8 months
@EmilevanKrieken Exactly, we might try with the absorbing state at some point and see how it affects the results
1
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@albfresco @francoisfleuret The rejection sampling scheme allow to generate sequence in a few steps instead of generating tokens one by one
1
0
1
@ArnaudPannatier
Arnaud Pannatier
8 months
@basavasagar18 @basavasagar18 in that specific case, we only have 5 « types » of pixels (empty, wall, start, end, path) so the nature of the data is discrete
0
0
1
@ArnaudPannatier
Arnaud Pannatier
3 months
@nudelbrot @francoisfleuret @karpathy @micheli_vincent @EloiAlonso1 Yeah changing will let you use any kind of ordering. But if you want to be able to change on the fly at generation time, you need to keep a random order during training. But σ-GPT does not use masking like BERT it is causal like a standard GPT.
1
0
1
@ArnaudPannatier
Arnaud Pannatier
8 months
@arnicas We are working on an arXiv version of this thread. We'll release a project site with more examples, along with the full codebase for the discrete diffusion process at the same time.
0
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@qiqi123_yu @francoisfleuret @EvannCourdier @SkysoftATM you can use it in a fully autoregressive manner or use our burst-sampling scheme which generates multiple tokens at every step.
0
0
1
@ArnaudPannatier
Arnaud Pannatier
8 months
@samsja19 @francoisfleuret @EvannCourdier For reference here are some results with GPT-like models trained on the same task :
@francoisfleuret
François Fleuret
1 year
BTW tried bigger mazes. After two epochs it already holds very well. Note that it does not check the length is optimal, only that it is a proper path (continuous, no spurious path pixel).
Tweet media one
2
1
15
1
0
1
@ArnaudPannatier
Arnaud Pannatier
3 years
0
0
1
@ArnaudPannatier
Arnaud Pannatier
8 months
@cgarciae88 @francoisfleuret Hey! We are working on an arXiv version of this thread. The full codebase for the discrete diffusion process will be released at the same time. The Unet model we used is the one linked from lucidrains. The maze generation is done with:
0
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@dr_mike_hammes @francoisfleuret @EvannCourdier @SkysoftATM We did not specifically tested that but indeed we can probably prompt the model at inference time with the <sos> and <eos> at desired position and let the model complete the rest. And for the first part: prompting with <eos> and seeing were in the sequence <eos> is more likely.
0
0
1
@ArnaudPannatier
Arnaud Pannatier
8 years
@AbsolvoCom À quand la version Android ?! #multiplatforme
1
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@joshua_saxe @francoisfleuret @EvannCourdier @SkysoftATM We initially thought something along these lines: that it would serve as a kind of data-augmentation. But from what we've seen in the experiments it seems that it's the reverse: in a low-data setting, as learning in a shuffled order is harder, the model memorise more the data.
0
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@Cesarfalante @francoisfleuret @EvannCourdier @SkysoftATM We did not explore it but it is indeed a very interesting idea.
@ArnaudPannatier
Arnaud Pannatier
5 months
@ffaebi @francoisfleuret @EvannCourdier @SkysoftATM Here, we don't resample tokens once they are generated. But it is absolutely possible to remove any token you want during generation. It should also be possible to use the model to highlight tokens that are unlikely given the partial generation but we did not investigate it yet.
1
0
6
0
0
1
@ArnaudPannatier
Arnaud Pannatier
5 months
@fouriergalois @francoisfleuret @EvannCourdier @SkysoftATM We didn't. We thought about it, but as training in a random order needs more compute than in the left-to-right order, it would probably still require a good amount of compute which was too much for us. And we would still need to use double pos. enc w/o breaking the base model.
0
0
1
@ArnaudPannatier
Arnaud Pannatier
4 months
@OfirPress Better late than never!
@ArnaudPannatier
Arnaud Pannatier
5 months
@PrayagBhakar It's part of @francoisfleuret 's awesome picoclvr suite see maze, and many other cool experiments there descriptions:
0
0
2
0
0
1