Olivier Hénaff Profile Banner
Olivier Hénaff Profile
Olivier Hénaff

@olivierhenaff

2,005
Followers
242
Following
43
Media
151
Statuses

Senior Staff Research Scientist @GoogleDeepMind , interested in active, multimodal, and memory-augmented learning. Formerly @NYU_CNS and @Polytechnique

London, UK
Joined May 2019
Don't wanna be here? Send us removal request.
Pinned Tweet
@olivierhenaff
Olivier Hénaff
3 months
Thrilled to announce our latest work on active data curation: joint example selection (JEST) drastically accelerates large-scale multimodal pretraining, surpassing previous SoTA (SigLIP) with 10x fewer iterations and FLOPS:
Tweet media one
5
49
290
@olivierhenaff
Olivier Hénaff
10 months
So excited to announce what we've been working on for the past ~year or so: Active Learning Accelerates Large-Scale Visual Understanding We show that model-based data selection efficiently and effectively speeds up classification- and multimodal pretraining by up to 50%
Tweet media one
9
76
384
@olivierhenaff
Olivier Hénaff
2 years
Self-supervised representation learning is greatly facilitated by the knowledge of objects and their layouts in real-world scenes. Rather than hard-coding these priors, with our new method Odin we found that objects can be discovered from the learned representations themselves.
Tweet media one
3
50
327
@olivierhenaff
Olivier Hénaff
1 year
In-context learning has revolutionized NLP, yet computer vision still relies on the cumbersome pre-train + fine-tune protocol. Introducing Hummingbird, a model capable of in-context scene understanding with a remarkably simple mechanism: memory retrieval.
Tweet media one
8
57
271
@olivierhenaff
Olivier Hénaff
3 years
What makes self-supervised learning from uncurated data so challenging? We identified the heavy-tailed distribution of image content as a limiting factor, and address it with Divide and Contrast. Leads to big gains in data-efficient and transfer learning.
Tweet media one
2
60
217
@olivierhenaff
Olivier Hénaff
3 years
How does the brain make predictions about its environment? We found primary visual cortex to straighten the temporal trajectories of natural videos, facilitating their extrapolation.
2
52
178
@olivierhenaff
Olivier Hénaff
3 years
Self-supervised learning has made tremendous gains for simple visual tasks like recognizing objects, but interacting with the environment requires detecting, parsing, and understanding their geometry. Contrastive detection greatly facilitates these tasks.
Tweet media one
1
39
175
@olivierhenaff
Olivier Hénaff
4 years
How does the brain keep track of its uncertainty? We found neural gain variability to represent stimulus uncertainty across the visual hierarchy, and explain these findings with a stochastic variant of divisive normalization:
Tweet media one
1
53
161
@olivierhenaff
Olivier Hénaff
8 months
Humans and animals reason about events spanning days, weeks, and years, yet current CV systems live largely in the present. Introducing Memory-Consolidated ViT, whose context extends far into the past and sets a new SOTA in long-video understanding with a 10x smaller model
Tweet media one
5
27
172
@olivierhenaff
Olivier Hénaff
1 year
With the advent of AI assistants, humans depend on their outputs at an unprecedented rate. It is vital that these models be aligned with human abilities, judgements, and preferences. Self-supervised video pretraining yields a big step towards human-aligned visual representations
Tweet media one
2
31
163
@olivierhenaff
Olivier Hénaff
10 months
Thrilled to announce that we have an opening for a Student Researcher to come work with us at @GoogleDeepMind ! If you’re interested in multimodal learning, in-context adaptation, memory-augmented perception, or active learning, do consider applying:
3
37
149
@olivierhenaff
Olivier Hénaff
5 years
Very happy to share our latest unsupervised representation learning work! In addition to SOTA linear classification, we beat supervised networks on ImageNet with 2-5x less labels and transfer to PASCAL detection better than supervised pre-training.
Tweet media one
1
24
106
@olivierhenaff
Olivier Hénaff
2 years
Understanding the world by watching it evolve over time might be the holy grail of self-supervised learning. Yet learning from videos has always lagged behind learning from images. Until now!
1
13
102
@olivierhenaff
Olivier Hénaff
3 years
It's a great pleasure to announce that the art installation I've been working on will be exhibiting at the Florence Trust in London. If you're interested in neural representations, memory/reality, or an interactive audio-visual experience, come check it out!
Tweet media one
2
3
48
@olivierhenaff
Olivier Hénaff
4 years
Are we done with ImageNet? Yes: we found the original labels to no longer be the best predictors of human preferences. But no: with our new label-set, we have removed much of the biases of the original, creating a better benchmark for future research.
Tweet media one
1
6
39
@olivierhenaff
Olivier Hénaff
8 months
There's been lots of great work in long-context video understanding, but most of it comes at the cost of greater conceptual & computational complexity We instead opted for an ultra-minimalist approach, letting the model attend to past activations consolidated into a memory bank
Tweet media one
1
3
35
@olivierhenaff
Olivier Hénaff
10 months
To do this we developed a framework that allows a model to continuously filter the data it will learn from. In it, actors consume raw data, score it, and send it to the learner using prioritized replay. Of the criteria we tried, example difficulty and learnability worked well...
Tweet media one
1
1
34
@olivierhenaff
Olivier Hénaff
10 months
Wondering what self-supervised learning from videos can do for your visual representations? Come find out at our #NeurIPS2023 poster #920 , from 5-7pm tonight!
@nikparth1
Nikhil Parthasarathy
1 year
Excited to share that our work on learning more human-aligned, robust visual representations via video pretraining has been accepted to #Neurips2023 ! Thanks so much to my collaborators @olivierhenaff @arkitus and @joaocarreira ! Stay tuned for the updated camera-ready version!
0
7
56
0
5
32
@olivierhenaff
Olivier Hénaff
10 months
This means that our method not only accelerates learning in terms of learner iterations, but also total computation of the actors + learner. This is to our knowledge the first example of such "compute-positive" active learning
Tweet media one
1
2
27
@olivierhenaff
Olivier Hénaff
2 years
Odin learns object-level features with a contrastive objective and approximate image masks. It discovers objects by clustering those features, and feeds the resulting masks back into the learning objective, engaging in a virtuous cycle of representation and segmentation quality.
Tweet media one
1
2
21
@olivierhenaff
Olivier Hénaff
10 months
However scoring examples can be ~as expensive as learning from them! We therefore drastically scale down the size of the actors relative to the learner (up to 1000x), making them v cheap to run. Example learnability (but not difficulty) was very robust to this actor downscaling
Tweet media one
1
2
21
@olivierhenaff
Olivier Hénaff
2 years
Finally, Odin requires no prior knowledge about the structure of objects in real-world scenes, raising the possibility of discovering the structure of arbitrary sensory modalities and their combinations. Paper here:
2
1
20
@olivierhenaff
Olivier Hénaff
1 year
Super excited for the @ICCVConference SSL Tutorial!! I'll be speaking about scene understanding, human alignment, and memory retrieval at 4:15pm Paris time, the full line-up is here: , starting soon in P02
0
3
19
@olivierhenaff
Olivier Hénaff
10 months
Hey Ibrahim, thanks! We did a scaling study that showed that the effect seemed to be very robust wrt training duration (see below). Also in the multimodal regime we show nice gains over IID CLIP with decent compute budgets (eg ActiveCLIP w 8B examples >> OpenCLIP w 34B examples)
Tweet media one
@ibomohsin
Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن
10 months
@olivierhenaff Cool work! Have you plotted the gap vs # of seen examples for very long training durations (>>900M ex)? I'm wondering if active learning is beneficial for short training durations only, but the gap would shrink/vanish eventually (since you rely on small models).
0
0
1
0
1
14
@olivierhenaff
Olivier Hénaff
10 months
All in all this makes me optimistic that large-scale pretraining might become much cheaper, fairer, and more accessible with data-selection criteria that generalize across models and tasks
1
0
13
@olivierhenaff
Olivier Hénaff
10 months
In the multimodal setting, we find our method is highly "steerable": by computing example learnability (the difference in loss of the learner and a "reference" model) with ref. models trained on small curated datasets (LTIP), we can leverage much larger, noisier datasets (ALIGN)
Tweet media one
1
0
12
@olivierhenaff
Olivier Hénaff
3 years
@fhuszar If training speed is a requirement, I would shamelessly recommend contrastive detection. It converges much faster than SimCLR and BYOL, and leads to SOTA transfer learning. Paper: Code:
0
0
11
@olivierhenaff
Olivier Hénaff
1 year
It was very important to get a few details right, particularly the distribution of videos we learn from. Most video datasets have very imbalanced content. We fixed this with VideoNet, a new dataset with a uniform distribution across a diverse set of classes, similar to ImageNet
Tweet media one
1
0
10
@olivierhenaff
Olivier Hénaff
3 years
Contrastive detection works by carving images into pieces in an unsupervised manner, then learning from each simultaneously. This amplifies the learning signal per image, greatly accelerating convergence (up to 10x faster than SimCLR and BYOL).
Tweet media one
1
0
9
@olivierhenaff
Olivier Hénaff
1 year
Finally, VITO learns to attend to the same parts of images as humans do, in an entirely self-supervised manner, displaying an emergent alignment that is stronger than that of models explicitly trained for this
Tweet media one
1
3
9
@olivierhenaff
Olivier Hénaff
9 months
Welcome back Nikhil!!
@nikparth1
Nikhil Parthasarathy
9 months
A bit belated, but happy to share that I defended my PhD thesis! Thanks to my advisor @EeroSimoncelli for all the support along the way! This week I'm excited to start a new chapter as a Research Scientist @GoogleDeepMind building models of vision with @olivierhenaff and others!
Tweet media one
4
5
82
0
0
9
@olivierhenaff
Olivier Hénaff
3 years
We train individual self-supervised experts on each of those subsets, then distill them back into a single model. Although this was designed for handling uncurated data, it also leads to big improvements in data-efficient ImageNet classification!
Tweet media one
1
0
9
@olivierhenaff
Olivier Hénaff
7 months
Super excited to chat with @misssaxbys about brains, machines, vision, memory, and art and what they can all learn from each other tomorrow at the @MindstoneHQ AI meetup tomorrow. Do stop by if you're in London!
0
4
10
@olivierhenaff
Olivier Hénaff
7 months
This was such a good time, and a nice opportunity to reflect on why I think AI research can be used to deepen our understanding of the human experience. Thanks so much @misssaxbys for inviting me and @JoshuaWohle for hosting!
@misssaxbys
Eleanor Warnock
7 months
Discussing memory, art + what AI can teach us about human perception and the mind in a church ⛪️ was certainly a career highlight. A huge thank you to @olivierhenaff for sharing your insights and @JoshuaWohle for having us
Tweet media one
1
0
5
0
3
9
@olivierhenaff
Olivier Hénaff
8 months
Finally, when evaluating on long-video understanding benchmarks like EgoSchema and Perception Test, we find MC-ViT to outperform billion-scale VLM's like Flamingo and SeViLa despite using 10x fewer parameters
Tweet media one
1
0
8
@olivierhenaff
Olivier Hénaff
3 years
All in all this points to the promise of a much more active form of self-supervised learning. This project was a thoroughly enjoyable collaboration with @YonglongT and @avdnoord !
0
0
8
@olivierhenaff
Olivier Hénaff
8 months
All in all, this raises the possibility of simple architectures freely associating across time and modalities, realizing their potential as multimodal assistants and enriching our everyday experience
Tweet media one
1
1
8
@olivierhenaff
Olivier Hénaff
3 years
Contrastive methods benefit a lot from training on ImageNet, which contains many fine-grained classes that are almost perfectly balanced. Uncurated datasets lack this property, but we can approximately recover it by clustering self-supervised representations.
Tweet media one
1
1
8
@olivierhenaff
Olivier Hénaff
2 years
Thanks @skandakoppula Evan Shelhamer @DanielZoran_ @drew_jaegle Andrew Zisserman @joaocarreira and @relja_work for the great collaboration. See you all at @eccvconf !
0
0
7
@olivierhenaff
Olivier Hénaff
1 year
In summary, the amazing progress we’ve seen with LLMs is a reminder that, like humans, the most impactful ML systems are ones that are able to generalize across many tasks, rather than specializing to any given one. Hopefully computer vision can follow in this direction
1
0
7
@olivierhenaff
Olivier Hénaff
8 months
Thanks to this, MC-ViT displays excellent scaling behavior when learning from long videos, outperforming joint space-time attention and efficient approximations with a 10x smaller memory footprint
Tweet media one
2
0
7
@olivierhenaff
Olivier Hénaff
1 year
Memory retrieval allows a single Hummingbird to perform multiple scene understanding tasks without modification. What else? Hummingbird retrieval is also much more data-efficient and capable of fast adaptation than fine-tuning or linear probing
Tweet media one
2
0
7
@olivierhenaff
Olivier Hénaff
3 years
Big thanks to @skandakoppula @jalayrac @avdnoord @OriolVinyalsML @joaocarreira for a thoroughly enjoyable collaboration! :)
1
0
7
@olivierhenaff
Olivier Hénaff
1 year
This was all possible thanks to a fantastic collaboration with @ibalazevic , david steiner, @nikparth1 , and @relja_work and represents our first foray into combining vision and memory research. Looking forward to more to come!
1
0
7
@olivierhenaff
Olivier Hénaff
10 months
We'll be presenting our #NeurIPS2023 work on in-context scene understanding from 10:45 to 12:45 at poster #810 in Hall B. Come by and say hello!
@ibalazevic
Ivana Balazevic
1 year
✨Excited to announce that Hummingbird, our model for in-context scene understanding, has been accepted as a spotlight at #NeurIPS2023 !✨ Work done with fantastic colleagues @GoogleDeepMind : David Steiner, @nikparth1 , @relja_work and @olivierhenaff .
2
9
81
0
1
6
@olivierhenaff
Olivier Hénaff
1 year
We quantify alignment with three signatures of human intelligence: generalization across tasks, robustness to perturbations, and consistency with human judgements. Across tasks, our self-supervised video model, VITO, matches specialist models designed specifically for each
Tweet media one
1
1
6
@olivierhenaff
Olivier Hénaff
1 year
Given a trained representation, making inferences about new images is very simple: just provide a prompt of annotated examples + retrieve the labels that are closest to each feature. This makes few assumptions about the labels and works for tasks like segmentation and depth pred
1
1
6
@olivierhenaff
Olivier Hénaff
1 year
You can find our updated tech report here: , huge congratulations to @nikparth1 for rounding out an amazing internship project!
0
1
6
@olivierhenaff
Olivier Hénaff
3 years
Together, these results raise the possibility of a new generation of learning algorithms that are more widely accessible, alleviating the need for large amounts of human annotation, and computation. I’m thrilled that this work was accepted to #ICCV21 as an oral presentation!
1
1
5
@olivierhenaff
Olivier Hénaff
3 years
@DeepMind So excited to see the open-source release of our self-supervised learning algorithm, DetCon. If you're interested in solving hard vision tasks in less time, check it out at
0
1
5
@olivierhenaff
Olivier Hénaff
1 year
How does VITO do all of this? With a very simple self-supervised recipe: learning to track stable and distinctive content in videos with lightweight attention and contrastive learning. Crucially, however...
Tweet media one
1
0
5
@olivierhenaff
Olivier Hénaff
3 years
Previously, found we human observers to also straighten natural videos. Interestingly, this perceptual straightening was about 3x bigger than neural straightening in V1. Might further neural processing be required to match perceptual straightening?
1
0
5
@olivierhenaff
Olivier Hénaff
1 year
The issue is that most pretraining methods (e.g. supervised or MoCo) don’t work well when decoded in this manner (without finetuning). In contrast Hummingbird uses cross-attention during training to prepare it for non-parametric eval, yielding large gains over the rest
Tweet media one
1
0
5
@olivierhenaff
Olivier Hénaff
1 year
Together, the diversity of VITO's benefits (task generality, robustness, human consistency) and the simplicity of its training recipe suggest that there is a great untapped potential in video pretraining as a paradigm for learning general, human-aligned visual representations
1
0
4
@olivierhenaff
Olivier Hénaff
3 years
We quantified the predictability of natural videos using the curvature of the sequence of frames: in the pixel-domain, or as represented by a population of jointly-recorded V1 neurons. Comparing the two curvatures, we found they were much smaller in V1!
Tweet media one
1
0
4
@olivierhenaff
Olivier Hénaff
3 years
Finally, we fit a two-stage, linear-nonlinear model to neural responses, and found that it accounted for these effects *provided* we included the nonlinearities. V1 neurons could therefore be using their nonlinear computations to enable prediction in the natural environment.
Tweet media one
1
0
4
@olivierhenaff
Olivier Hénaff
10 months
@jimwinkens @arkitus Hey Jim! It does, because the online actor is only ever compared to a reference model of the same (very small) size, not the (much larger) learner. See also () for our scaling study
@olivierhenaff
Olivier Hénaff
10 months
Hey Ibrahim, thanks! We did a scaling study that showed that the effect seemed to be very robust wrt training duration (see below). Also in the multimodal regime we show nice gains over IID CLIP with decent compute budgets (eg ActiveCLIP w 8B examples >> OpenCLIP w 34B examples)
Tweet media one
0
1
14
0
0
4
@olivierhenaff
Olivier Hénaff
3 years
Huge thanks to the dream team of collaborators Yoon Bai, @julie_charlton_ , Ian Nauhaus, Eero Simoncelli, and @GorisLab for making this possible!!
1
0
4
@olivierhenaff
Olivier Hénaff
1 year
@VictorButoi Hi Victor, thanks! The prompts are the same as the training data used for transfer learning. Since we don't know ahead of time what classes an image will contain, it needs to attend to a prompt containing all classes
1
0
3
@olivierhenaff
Olivier Hénaff
1 year
VITO is also much more robust to image deformations that are likely to occur in the real world, surpassing supervised, self-supervised, and adversarial image pretraining
Tweet media one
1
0
3
@olivierhenaff
Olivier Hénaff
10 months
@Le_Zhang7 @GoogleDeepMind Hi Le, thanks for reaching out! The opening is for working with us in the London office, so you should apply for a "BS/MS/PhD placement in EMEA"
0
0
3
@olivierhenaff
Olivier Hénaff
2 years
Congratulations to @nikparth1 on a very successful internship project, and big thanks to @joaocarreira and @arkitus for another enjoyable collaboration!
3
0
2
@olivierhenaff
Olivier Hénaff
2 years
Interestingly, VITO seems to learn to parse real world scenes by binding together content that co-occurs in time, building semantic correspondences between frames in a video and discovering high-level concepts.
Tweet media one
1
1
2
@olivierhenaff
Olivier Hénaff
8 months
@mayfer Yes exactly, with consolidation happening with a very simple non-parametric reconstructive process like k-means or coreset selection. Lots more to explore here!
0
0
2
@olivierhenaff
Olivier Hénaff
3 years
Neural straightening appeared to be specific to natural videos: when recording the responses to artificial videos, we found them to be *more* curved than in the pixel domain.
Tweet media one
1
0
2
@olivierhenaff
Olivier Hénaff
2 years
I had a great time speaking at UvA's Deep Vision Seminar, check out the recording if you're curious about self-supervised theories of biological and artificial intelligence, object discovery, and leveraging videos for scene understanding. Thanks @y_m_asano for the invitation!!
@y_m_asano
Yuki
2 years
Happy to share with you the QUvA Deep Vision Seminar talk from @olivierhenaff from @DeepMind on "The virtuous cycle of object discovery and representation" is now online: Enjoy😊:
0
2
32
1
1
2
@olivierhenaff
Olivier Hénaff
2 years
Surprisingly, VITO achieves this with a few simple changes to the standard contrastive paradigm: data curation, better augmentations, and attention pooling. This suggests there is plenty of room for video pretraining to become the new default for learning image representations.
Tweet media one
1
1
2
@olivierhenaff
Olivier Hénaff
2 years
We present VITO, a self-supervised method that learns from the dynamic evolution of video frames. For the first time, VITO closes the gap with ImageNet pretraining on a range of scene understanding tasks (COCO, LVIS, PASCAL, and ADE20K).
Tweet media one
1
3
1
@olivierhenaff
Olivier Hénaff
8 months
@NielsRogge @ibalazevic @YugeTen @pinelopip3 @Rahma_Chaa @skandakoppula For us, yes, but that is still quite a bit longer than what most VLM's are able to attend to! I agree however that we're in dire need of more long-video understanding benchmarks. Perception Test and EgoSchema were essential steps in this direction but we need to go further
1
0
1