We combine these two functionalities into a burst-sampling scheme with rejection.
It generates all the remaining tokens in one shot, and keeps only the ones that are self-consistent. This allows sublinear sequence generation similar to diffusion.
5/6
GPTs are generating sequences in a left-to-right order. Is there another way?
With
@francoisfleuret
and
@evanncourdier
, in partnership with
@SkysoftATM
, we developed σ-GPT, capable of generating sequences in any order chosen dynamically at inference time.
1/6
Diffusion models are surprisingly good at solving algorithmic tasks.
With
@francoisfleuret
and
@evanncourdier
, we use discrete diffusion to find shortest paths in mazes represented as images.
1/5
The code for σ-GPT is out!
Feel free to experiment with it here:
It extends the
@francoisfleuret
picoclvr (non-nlp tasks) and
@karpathy
nanoGPT (text) codebases, with an additional KV-Cache from IRIS by
@micheli_vincent
and
@EloiAlonso1
.
(links below)
This opens a multitude of possibilities. Such as: (A) Conditioning on any subset of tokens from a sequence or (B) Estimating the probability of the remaining tokens at once.
3/6
We only need to shuffle the input sequence and to use a double positional encoding, so that the model is informed of what token to generate at each position of the output sequence.
2/6
(A) is done by putting the conditioning tokens “first”, and (B) is done by putting all the remaining tokens as “the next”. The latter requires carving the attention matrix.
4/6
We use discrete diffusion with a U-net. The maze with the start and goal is encoded in one channel, and the model denoises the maze with a solution in another channel.
Unet:
Discrete diffusion:
3/5
To estimate the denoising step p(x_{t-1} | x_t), the algorithm estimates p(x_0 | x_t). Visualizing this estimate during the process (bottom row) shows the "current hypotheses" , that eventually concentrates on the result.
5/5
Each maze is generated by adding iteratively horizontal and vertical walls. A starting location and a goal location are picked at random, and one path is sampled uniformly among the shortest paths computed with an exact algorithm from the starting point to the goal.
2/5
@samsja19
@francoisfleuret
@EvannCourdier
Thanks a lot!
Surprisingly in our exps LLMs were also able to solve the problem with near perfect accuracy even if we had to flatten inputs in a raster-scan order for the transformers. The only drawback is that it does not scale easily to larger sizes due to their quadratic cost
@TmlrPub
Transformers are great at extrapolation from sparse data.
In our last TMLR paper, in partnership with
@SkysoftATM
,
@francoisfleuret
, Kyle Matoba and I modified a ViT-like model to encode sparse context-data and to generate reliable conditional predictions anywhere in space.
1/5
@jstock37
@francoisfleuret
@EvannCourdier
@SkysoftATM
I agree that it can be limiting in a way, but it can also be useful for the model: fixing some tokens might give the overall structure of the text it want to generate, and then it only has to fill in between.
Really proud to announce that our paper on High-Altitude Windnowcasting with
@francoisfleuret
and R. Picatoste was accepted for publication at
#SIAMSDM22
!
Many thanks to the reviewers for their valuable remarks!
Pre-print:
@Pierre653878019
@francoisfleuret
I don't really know about the brain, but it was interesting to see that the model, before even starting the autoregression, as already a good idea of the solution that it wants to generate
@cavemanloverboy
@francoisfleuret
@EvannCourdier
@SkysoftATM
The ordering is indeed random. It's hard to directly learn an order as we don't have a "ground-truth" for the order.
But actually in the rejection sampling scheme there is something along these lines:
@ChrSzegedy
@francoisfleuret
@EvannCourdier
@SkysoftATM
The objective is similar, they both model shuffled sequences. The implementation differs: XLNet use masking and we use a double positional encoding instead, to give the model the position of the next token. The burst-sampling scheme is new. (More details in the related works)
@cavemanloverboy
@francoisfleuret
@EvannCourdier
@SkysoftATM
We generate all the remaining tokens at the same time and then picks N different orders (N is an hyperparameter) and keep the order which validate the most tokens. So if there is a preferred order it might be selected there
@TmlrPub
@SkysoftATM
@francoisfleuret
The goal of this project is to exploit live wind measurements from airplanes to generate reliable short term forecasts.
As airplanes are moving along sparse flight routes, we need an architecture that can handle unstructured sets of data.
2/5
@ffaebi
@francoisfleuret
@EvannCourdier
@SkysoftATM
Here, we don't resample tokens once they are generated. But it is absolutely possible to remove any token you want during generation.
It should also be possible to use the model to highlight tokens that are unlikely given the partial generation but we did not investigate it yet.
@_jasonliu_
@francoisfleuret
@EvannCourdier
Probably, I guess as for other deep learning tasks, the model will perform better on samples that are close to what it has seen in its training set.
@lemergenz
@samsja19
@francoisfleuret
@EvannCourdier
Even with Flash Attention, large mazes are 128x192 ≃ 24k tokens which is too much for our GPUs. But it could work with standard techniques (AR in the latent space for example )
@Walley7777
@francoisfleuret
@EvannCourdier
They are slower and can sometimes even output the wrong path. But comparing with traditional algorithms is not the point as they are exact and use particular data structures. Here, the model must understand the problem only by looking at pairs of images (empty maze/solved maze).
@TmlrPub
@SkysoftATM
@francoisfleuret
These problems usually use encoder-decoder models. The encoder processes the context measurements and the decoder is then queried with positions and outputs conditional forecasts.
We simplified that and used a single encoder stack, leading to a simple and robust model.
3/5
@aristot_3rd
@francoisfleuret
@EvannCourdier
@SkysoftATM
It can generate sequence in any order, so if you want to use it left-to-right it is still possible. Any tricks working for GPT should work here.
But even without left-to-right generation, it is still possible to prompt with anything you want (CoT included)
@JoshPurtell
Thanks for the reference, we’ll look into it. Here we did it more in a self-supervised manner than in a RL, but there seems to have similarities with the way models can correct partially incorrect representations
@TmlrPub
@SkysoftATM
@francoisfleuret
The reason why this architecture works well is that the attention mechanism allows each query to see the whole context without bottleneck. We showed that this leads to a better gradient flow, which helps during training.
4/5
@nudelbrot
@francoisfleuret
@karpathy
@micheli_vincent
@EloiAlonso1
If you want to gradually mask random cells to whole columns, I would go for a BERT-like model with a custom masking scheme.
That being said you could still use σ-GPT with a curriculum scheme that start left-to-right, then switch columns, and then goes full random.
@IoannisKakogeo1
@francoisfleuret
@EvannCourdier
@SkysoftATM
@SpyrosGidaris
Nice! We didn't have a look at the strength of signal back-propagated to the model (your fig.4 is cool).
I see that in your set of different permutation you used variants of raster-scan order and spirals.
Did you tried completely random orders at some point ?
@jonas_eschmann
@francoisfleuret
@EvannCourdier
@SkysoftATM
We actually don't as for now. Adding token is indeed possible (you only have to add one to the positional encoding of the rest of the sequence). We didn't study this in this work, but it's an interesting future line of work.
@st01014
@francoisfleuret
On text, we have generations of similar quality if you have enough diffusion steps.
The main advantage over diffusion is that the rejection sampling is dynamic: it will do less steps if the generation is simple.
More details there
@roviro_
@francoisfleuret
@EvannCourdier
Not directly, but models will be slower and can sometimes even output the wrong path. Comparing with traditional algorithms is not the point as they are conceived to be exact. Here, the model must understand the problem only by looking at pairs of images (empty maze/solved maze).
@giffmana
@ECMLPKDD
Exactly, it is the motivating task of this project: we were trying to model ascent/ descent of airplanes under control in a low-data regime. We got a repetition problem, the models sometimes outputs plateauing sequences at the wrong altitude and the random order help.
@TuanPham672604
@francoisfleuret
We did not specifically test this. But I think it should be directly possible by specifying an <end-of-sequence> token at the position that you want and let the model complete in between.
@pr0me
@francoisfleuret
For the text, we had to introduce a curriculum learning scheme where we start giving the model sequences in order and then progressively shuffle. But for the other task no adjustments were needed, just shuffling.
@NickestOfNicks
We are working on an arXiv version of this thread.
The full codebase for the discrete diffusion process will be released at the same time.
The Unet model we used is the one linked from lucidrains.
The maze generation is done with:
@yar_vol
@francoisfleuret
In this paper, we do not replace tokens. However, it is straightforward to remove manually tokens during sampling. And you can also reevaluate the partially generated sequence under another order and remove tokens as you want (but we don't study this in the paper)
@basavasagar18
@basavasagar18
in that specific case, we only have 5 « types » of pixels (empty, wall, start, end, path) so the nature of the data is discrete
@nudelbrot
@francoisfleuret
@karpathy
@micheli_vincent
@EloiAlonso1
Yeah changing will let you use any kind of ordering. But if you want to be able to change on the fly at generation time, you need to keep a random order during training.
But σ-GPT does not use masking like BERT it is causal like a standard GPT.
@arnicas
We are working on an arXiv version of this thread. We'll release a project site with more examples, along with the full codebase for the discrete diffusion process at the same time.
BTW tried bigger mazes. After two epochs it already holds very well.
Note that it does not check the length is optimal, only that it is a proper path (continuous, no spurious path pixel).
@cgarciae88
@francoisfleuret
Hey! We are working on an arXiv version of this thread. The full codebase for the discrete diffusion process will be released at the same time. The Unet model we used is the one linked from lucidrains. The maze generation is done with:
@dr_mike_hammes
@francoisfleuret
@EvannCourdier
@SkysoftATM
We did not specifically tested that but indeed we can probably prompt the model at inference time with the <sos> and <eos> at desired position and let the model complete the rest.
And for the first part: prompting with <eos> and seeing were in the sequence <eos> is more likely.
@joshua_saxe
@francoisfleuret
@EvannCourdier
@SkysoftATM
We initially thought something along these lines: that it would serve as a kind of data-augmentation. But from what we've seen in the experiments it seems that it's the reverse: in a low-data setting, as learning in a shuffled order is harder, the model memorise more the data.
@ffaebi
@francoisfleuret
@EvannCourdier
@SkysoftATM
Here, we don't resample tokens once they are generated. But it is absolutely possible to remove any token you want during generation.
It should also be possible to use the model to highlight tokens that are unlikely given the partial generation but we did not investigate it yet.
@fouriergalois
@francoisfleuret
@EvannCourdier
@SkysoftATM
We didn't. We thought about it, but as training in a random order needs more compute than in the left-to-right order, it would probably still require a good amount of compute which was too much for us. And we would still need to use double pos. enc w/o breaking the base model.