Olivier Hénaff @olivierhenaff Twitter profile

Pinned Tweet

Olivier Hénaff

3 months

Thrilled to announce our latest work on active data curation: joint example selection (JEST) drastically accelerates large-scale multimodal pretraining, surpassing previous SoTA (SigLIP) with 10x fewer iterations and FLOPS:

5

49

290

Last Seen Profiles

@chomplicated

@Brooke_A12

@ENovaxxx

@KasihLudah

@pencitaemanedut

@Hardyirawan8

@ComputerHistory

@creator_kanon

@Lorena_Sca_

@xle3y

@ClareAlcock2

@LRobinsonFAMU

@CoopCollie

@cupidtookstar

@ikaika_g

@alyssa_wypych23

@yuuki_ans

@dea_semok

@owteru_mk2

@IsisJohnson3

@ScottsdaleCC

@Hijabbacol2883

@DlblH13267

@emeraldbdesign

@sarahwoodwriter

@OUHTalks

@haltroy

@serkankutayer

@Year2

@HeadGodness

@Polaris_himeka

@_rifiia2

@bokeplokalmalam

@abdoo20206811

@AermaGD

Olivier Hénaff

@olivierhenaff

10 months

So excited to announce what we've been working on for the past ~year or so: Active Learning Accelerates Large-Scale Visual Understanding We show that model-based data selection efficiently and effectively speeds up classification- and multimodal pretraining by up to 50%

9

76

384

Olivier Hénaff

@olivierhenaff

2 years

Self-supervised representation learning is greatly facilitated by the knowledge of objects and their layouts in real-world scenes. Rather than hard-coding these priors, with our new method Odin we found that objects can be discovered from the learned representations themselves.

3

50

327

Olivier Hénaff

@olivierhenaff

1 year

In-context learning has revolutionized NLP, yet computer vision still relies on the cumbersome pre-train + fine-tune protocol. Introducing Hummingbird, a model capable of in-context scene understanding with a remarkably simple mechanism: memory retrieval.

8

57

271

Olivier Hénaff

@olivierhenaff

3 years

What makes self-supervised learning from uncurated data so challenging? We identified the heavy-tailed distribution of image content as a limiting factor, and address it with Divide and Contrast. Leads to big gains in data-efficient and transfer learning.

2

60

217

Olivier Hénaff

@olivierhenaff

3 years

How does the brain make predictions about its environment? We found primary visual cortex to straighten the temporal trajectories of natural videos, facilitating their extrapolation.

2

52

178

Olivier Hénaff

@olivierhenaff

3 years

Self-supervised learning has made tremendous gains for simple visual tasks like recognizing objects, but interacting with the environment requires detecting, parsing, and understanding their geometry. Contrastive detection greatly facilitates these tasks.

1

39

175

Olivier Hénaff

@olivierhenaff

4 years

How does the brain keep track of its uncertainty? We found neural gain variability to represent stimulus uncertainty across the visual hierarchy, and explain these findings with a stochastic variant of divisive normalization:

1

53

161

Olivier Hénaff

@olivierhenaff

8 months

Humans and animals reason about events spanning days, weeks, and years, yet current CV systems live largely in the present. Introducing Memory-Consolidated ViT, whose context extends far into the past and sets a new SOTA in long-video understanding with a 10x smaller model

5

27

172

Olivier Hénaff

@olivierhenaff

1 year

With the advent of AI assistants, humans depend on their outputs at an unprecedented rate. It is vital that these models be aligned with human abilities, judgements, and preferences. Self-supervised video pretraining yields a big step towards human-aligned visual representations

2

31

163

Olivier Hénaff

@olivierhenaff

10 months

Thrilled to announce that we have an opening for a Student Researcher to come work with us at @GoogleDeepMind ! If you’re interested in multimodal learning, in-context adaptation, memory-augmented perception, or active learning, do consider applying:

Careers

At the heart of Google DeepMind’s mission is our commitment to act as responsible pioneers in the field of AI, in service of society’s needs and expectations.

deepmind.google

3

37

149

Olivier Hénaff

@olivierhenaff

5 years

Very happy to share our latest unsupervised representation learning work! In addition to SOTA linear classification, we beat supervised networks on ImageNet with 2-5x less labels and transfer to PASCAL detection better than supervised pre-training.

1

24

106

Olivier Hénaff

@olivierhenaff

2 years

Understanding the world by watching it evolve over time might be the holy grail of self-supervised learning. Yet learning from videos has always lagged behind learning from images. Until now!

1

13

102

Olivier Hénaff

@olivierhenaff

3 years

It's a great pleasure to announce that the art installation I've been working on will be exhibiting at the Florence Trust in London. If you're interested in neural representations, memory/reality, or an interactive audio-visual experience, come check it out!

2

3

48

Olivier Hénaff

@olivierhenaff

10 months

Big thanks to the all-star team @talfanevans , Shreya Pathak, Hamza Merzic, @RyutaroTanno and @schwarzjn_ for making this happen! You can find the paper here:

Bad Students Make Great Teachers: Active Learning Accelerates...

Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow. Active learning methods aim to increase data efficiency by prioritizing learning on the most...

arxiv.org

0

2

44

Olivier Hénaff

@olivierhenaff

4 years

Are we done with ImageNet? Yes: we found the original labels to no longer be the best predictors of human preferences. But no: with our new label-set, we have removed much of the biases of the original, creating a better benchmark for future research.

1

6

39

Olivier Hénaff

@olivierhenaff

8 months

There's been lots of great work in long-context video understanding, but most of it comes at the cost of greater conceptual & computational complexity We instead opted for an ultra-minimalist approach, letting the model attend to past activations consolidated into a memory bank

1

3

35

Olivier Hénaff

@olivierhenaff

10 months

To do this we developed a framework that allows a model to continuously filter the data it will learn from. In it, actors consume raw data, score it, and send it to the learner using prioritized replay. Of the criteria we tried, example difficulty and learnability worked well...

1

34

Olivier Hénaff

@olivierhenaff

10 months

Wondering what self-supervised learning from videos can do for your visual representations? Come find out at our #NeurIPS2023 poster #920 , from 5-7pm tonight!

Nikhil Parthasarathy

@nikparth1

1 year

Excited to share that our work on learning more human-aligned, robust visual representations via video pretraining has been accepted to #Neurips2023 ! Thanks so much to my collaborators @olivierhenaff @arkitus and @joaocarreira ! Stay tuned for the updated camera-ready version!

0

7

56

0

5

32

Olivier Hénaff

@olivierhenaff

10 months

This means that our method not only accelerates learning in terms of learner iterations, but also total computation of the actors + learner. This is to our knowledge the first example of such "compute-positive" active learning

1

2

27

Olivier Hénaff

@olivierhenaff

2 years

Odin learns object-level features with a contrastive objective and approximate image masks. It discovers objects by clustering those features, and feeds the resulting masks back into the learning objective, engaging in a virtuous cycle of representation and segmentation quality.

1

2

21

Olivier Hénaff

@olivierhenaff

10 months

However scoring examples can be ~as expensive as learning from them! We therefore drastically scale down the size of the actors relative to the learner (up to 1000x), making them v cheap to run. Example learnability (but not difficulty) was very robust to this actor downscaling

1

2

21

Olivier Hénaff

@olivierhenaff

2 years

Finally, Odin requires no prior knowledge about the structure of objects in real-world scenes, raising the possibility of discovering the structure of arbitrary sensory modalities and their combinations. Paper here:

Object discovery and representation networks

The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning,...

arxiv.org

2

1

20

Olivier Hénaff

@olivierhenaff

1 year

Super excited for the @ICCVConference SSL Tutorial!! I'll be speaking about scene understanding, human alignment, and memory retrieval at 4:15pm Paris time, the full line-up is here: , starting soon in P02

0

3

19

Olivier Hénaff

@olivierhenaff

8 months

Another tour de force by @ibalazevic , and a thoroughly enjoyable new collaboration with @YugeTen , @pinelopip3 , @Rahma_Chaa , and @skandakoppula You can find the paper here

Memory Consolidation Enables Long-Context Video Understanding

Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity. While various attempts have been made to extend this context, this has often come at...

arxiv.org

3

5

16

Olivier Hénaff

@olivierhenaff

10 months

Hey Ibrahim, thanks! We did a scaling study that showed that the effect seemed to be very robust wrt training duration (see below). Also in the multimodal regime we show nice gains over IID CLIP with decent compute budgets (eg ActiveCLIP w 8B examples >> OpenCLIP w 34B examples)

Ibrahim Alabdulmohsin | إبراهيم العبدالمحسن

@ibomohsin

10 months

@olivierhenaff Cool work! Have you plotted the gap vs # of seen examples for very long training durations (>>900M ex)? I'm wondering if active learning is beneficial for short training durations only, but the gap would shrink/vanish eventually (since you rely on small models).

0

1

0

1

14

Olivier Hénaff

@olivierhenaff

10 months

All in all this makes me optimistic that large-scale pretraining might become much cheaper, fairer, and more accessible with data-selection criteria that generalize across models and tasks

1

0

13

Olivier Hénaff

@olivierhenaff

10 months

In the multimodal setting, we find our method is highly "steerable": by computing example learnability (the difference in loss of the learner and a "reference" model) with ref. models trained on small curated datasets (LTIP), we can leverage much larger, noisier datasets (ALIGN)

1

0

12

Olivier Hénaff

@olivierhenaff

3 years

@fhuszar If training speed is a requirement, I would shamelessly recommend contrastive detection. It converges much faster than SimCLR and BYOL, and leads to SOTA transfer learning. Paper: Code:

GitHub - google-deepmind/detcon

Contribute to google-deepmind/detcon development by creating an account on GitHub.

github.com

0

11

Olivier Hénaff

@olivierhenaff

1 year

It was very important to get a few details right, particularly the distribution of videos we learn from. Most video datasets have very imbalanced content. We fixed this with VideoNet, a new dataset with a uniform distribution across a diverse set of classes, similar to ImageNet

1

0

10

Olivier Hénaff

@olivierhenaff

3 years

Contrastive detection works by carving images into pieces in an unsupervised manner, then learning from each simultaneously. This amplifies the learning signal per image, greatly accelerating convergence (up to 10x faster than SimCLR and BYOL).

1

0

9

Olivier Hénaff

@olivierhenaff

1 year

Finally, VITO learns to attend to the same parts of images as humans do, in an entirely self-supervised manner, displaying an emergent alignment that is stronger than that of models explicitly trained for this

1

3

9

Olivier Hénaff

@olivierhenaff

9 months

Welcome back Nikhil!!

Nikhil Parthasarathy

@nikparth1

9 months

A bit belated, but happy to share that I defended my PhD thesis! Thanks to my advisor @EeroSimoncelli for all the support along the way! This week I'm excited to start a new chapter as a Research Scientist @GoogleDeepMind building models of vision with @olivierhenaff and others!

4

5

82

0

9

Olivier Hénaff

@olivierhenaff

3 years

We train individual self-supervised experts on each of those subsets, then distill them back into a single model. Although this was designed for handling uncurated data, it also leads to big improvements in data-efficient ImageNet classification!

1

0

9

Olivier Hénaff

@olivierhenaff

7 months

Super excited to chat with @misssaxbys about brains, machines, vision, memory, and art and what they can all learn from each other tomorrow at the @MindstoneHQ AI meetup tomorrow. Do stop by if you're in London!

Mindstone London AI Meetup Feb '24

Welcome to the most electrifying and ground-breaking AI meetup in London! Join us once a month as we delve into the world of artificial intelligence, explore i

community.mindstone.com

0

4

10

Olivier Hénaff

@olivierhenaff

7 months

This was such a good time, and a nice opportunity to reflect on why I think AI research can be used to deepen our understanding of the human experience. Thanks so much @misssaxbys for inviting me and @JoshuaWohle for hosting!

Eleanor Warnock

@misssaxbys

7 months

Discussing memory, art + what AI can teach us about human perception and the mind in a church ⛪️ was certainly a career highlight. A huge thank you to @olivierhenaff for sharing your insights and @JoshuaWohle for having us

1

0

5

0

3

9

Olivier Hénaff

@olivierhenaff

8 months

Finally, when evaluating on long-video understanding benchmarks like EgoSchema and Perception Test, we find MC-ViT to outperform billion-scale VLM's like Flamingo and SeViLa despite using 10x fewer parameters

1

0

8

Olivier Hénaff

@olivierhenaff

3 years

All in all this points to the promise of a much more active form of self-supervised learning. This project was a thoroughly enjoyable collaboration with @YonglongT and @avdnoord !

0

8

Olivier Hénaff

@olivierhenaff

8 months

All in all, this raises the possibility of simple architectures freely associating across time and modalities, realizing their potential as multimodal assistants and enriching our everyday experience

1

8

Olivier Hénaff

@olivierhenaff

3 years

Contrastive methods benefit a lot from training on ImageNet, which contains many fine-grained classes that are almost perfectly balanced. Uncurated datasets lack this property, but we can approximately recover it by clustering self-supervised representations.

1

8

Olivier Hénaff

@olivierhenaff

2 years

Thanks @skandakoppula Evan Shelhamer @DanielZoran_ @drew_jaegle Andrew Zisserman @joaocarreira and @relja_work for the great collaboration. See you all at @eccvconf !

0

7

Olivier Hénaff

@olivierhenaff

1 year

In summary, the amazing progress we’ve seen with LLMs is a reminder that, like humans, the most impactful ML systems are ones that are able to generalize across many tasks, rather than specializing to any given one. Hopefully computer vision can follow in this direction

1

0

7

Olivier Hénaff

@olivierhenaff

8 months

Thanks to this, MC-ViT displays excellent scaling behavior when learning from long videos, outperforming joint space-time attention and efficient approximations with a 10x smaller memory footprint

2

0

7

Olivier Hénaff

@olivierhenaff

1 year

Memory retrieval allows a single Hummingbird to perform multiple scene understanding tasks without modification. What else? Hummingbird retrieval is also much more data-efficient and capable of fast adaptation than fine-tuning or linear probing

2

0

7

Olivier Hénaff

@olivierhenaff

3 years

Big thanks to @skandakoppula @jalayrac @avdnoord @OriolVinyalsML @joaocarreira for a thoroughly enjoyable collaboration! :)

1

0

7

Olivier Hénaff

@olivierhenaff

1 year

This was all possible thanks to a fantastic collaboration with @ibalazevic , david steiner, @nikparth1 , and @relja_work and represents our first foray into combining vision and memory research. Looking forward to more to come!

1

0

7

Olivier Hénaff

@olivierhenaff

10 months

We'll be presenting our #NeurIPS2023 work on in-context scene understanding from 10:45 to 12:45 at poster #810 in Hall B. Come by and say hello!

Ivana Balazevic

@ibalazevic

1 year

✨Excited to announce that Hummingbird, our model for in-context scene understanding, has been accepted as a spotlight at #NeurIPS2023 !✨ Work done with fantastic colleagues @GoogleDeepMind : David Steiner, @nikparth1 , @relja_work and @olivierhenaff .

2

9

81

0

1

6

Olivier Hénaff

@olivierhenaff

1 year

We quantify alignment with three signatures of human intelligence: generalization across tasks, robustness to perturbations, and consistency with human judgements. Across tasks, our self-supervised video model, VITO, matches specialist models designed specifically for each

1

6

Olivier Hénaff

@olivierhenaff

1 year

Given a trained representation, making inferences about new images is very simple: just provide a prompt of annotated examples + retrieve the labels that are closest to each feature. This makes few assumptions about the labels and works for tasks like segmentation and depth pred

1

6

Olivier Hénaff

@olivierhenaff

1 year

You can find our updated tech report here: , huge congratulations to @nikparth1 for rounding out an amazing internship project!

0

1

6

Olivier Hénaff

@olivierhenaff

3 years

Together, these results raise the possibility of a new generation of learning algorithms that are more widely accessible, alleviating the need for large amounts of human annotation, and computation. I’m thrilled that this work was accepted to #ICCV21 as an oral presentation!

1

5

Olivier Hénaff

@olivierhenaff

3 years

@DeepMind So excited to see the open-source release of our self-supervised learning algorithm, DetCon. If you're interested in solving hard vision tasks in less time, check it out at

GitHub - google-deepmind/detcon

Contribute to google-deepmind/detcon development by creating an account on GitHub.

github.com

0

1

5

Olivier Hénaff

@olivierhenaff

1 year

How does VITO do all of this? With a very simple self-supervised recipe: learning to track stable and distinctive content in videos with lightweight attention and contrastive learning. Crucially, however...

1

0

5

Olivier Hénaff

@olivierhenaff

3 years

Previously, found we human observers to also straighten natural videos. Interestingly, this perceptual straightening was about 3x bigger than neural straightening in V1. Might further neural processing be required to match perceptual straightening?

Perceptual straightening of natural videos

Nature Neuroscience - The brain predicts future sensory input. The authors hypothesize that the visual system achieves this by straightening the temporal trajectories of natural videos, and they...

www.nature.com

1

0

5

Olivier Hénaff

@olivierhenaff

1 year

The issue is that most pretraining methods (e.g. supervised or MoCo) don’t work well when decoded in this manner (without finetuning). In contrast Hummingbird uses cross-attention during training to prepare it for non-parametric eval, yielding large gains over the rest

1

0

5

Olivier Hénaff

@olivierhenaff

1 year

Together, the diversity of VITO's benefits (task generality, robustness, human consistency) and the simplicity of its training recipe suggest that there is a great untapped potential in video pretraining as a paradigm for learning general, human-aligned visual representations

1

0

4

Olivier Hénaff

@olivierhenaff

3 years

We quantified the predictability of natural videos using the curvature of the sequence of frames: in the pixel-domain, or as represented by a population of jointly-recorded V1 neurons. Comparing the two curvatures, we found they were much smaller in V1!

1

0

4

Olivier Hénaff

@olivierhenaff

3 years

Finally, we fit a two-stage, linear-nonlinear model to neural responses, and found that it accounted for these effects *provided* we included the nonlinearities. V1 neurons could therefore be using their nonlinear computations to enable prediction in the natural environment.

1

0

4

Olivier Hénaff

@olivierhenaff

10 months

@jimwinkens @arkitus Hey Jim! It does, because the online actor is only ever compared to a reference model of the same (very small) size, not the (much larger) learner. See also () for our scaling study

Olivier Hénaff

@olivierhenaff

10 months

Hey Ibrahim, thanks! We did a scaling study that showed that the effect seemed to be very robust wrt training duration (see below). Also in the multimodal regime we show nice gains over IID CLIP with decent compute budgets (eg ActiveCLIP w 8B examples >> OpenCLIP w 34B examples)

0

1

14

0

4

Olivier Hénaff

@olivierhenaff

3 years

Huge thanks to the dream team of collaborators Yoon Bai, @julie_charlton_ , Ian Nauhaus, Eero Simoncelli, and @GorisLab for making this possible!!

1

0

4

Olivier Hénaff

@olivierhenaff

1 year

@VictorButoi Hi Victor, thanks! The prompts are the same as the training data used for transfer learning. Since we don't know ahead of time what classes an image will contain, it needs to attend to a prompt containing all classes

1

0

3

Olivier Hénaff

@olivierhenaff

1 year

VITO is also much more robust to image deformations that are likely to occur in the real world, surpassing supervised, self-supervised, and adversarial image pretraining

1

0

3

Olivier Hénaff

@olivierhenaff

10 months

@Le_Zhang7 @GoogleDeepMind Hi Le, thanks for reaching out! The opening is for working with us in the London office, so you should apply for a "BS/MS/PhD placement in EMEA"

0

3

Olivier Hénaff

@olivierhenaff

2 years

Congratulations to @nikparth1 on a very successful internship project, and big thanks to @joaocarreira and @arkitus for another enjoyable collaboration!

3

0

2

Olivier Hénaff

@olivierhenaff

2 years

Interestingly, VITO seems to learn to parse real world scenes by binding together content that co-occurs in time, building semantic correspondences between frames in a video and discovering high-level concepts.

1

2

Olivier Hénaff

@olivierhenaff

8 months

@mayfer Yes exactly, with consolidation happening with a very simple non-parametric reconstructive process like k-means or coreset selection. Lots more to explore here!

0

2

Olivier Hénaff

@olivierhenaff

3 years

Neural straightening appeared to be specific to natural videos: when recording the responses to artificial videos, we found them to be *more* curved than in the pixel domain.

1

0

2

Olivier Hénaff

@olivierhenaff

2 years

I had a great time speaking at UvA's Deep Vision Seminar, check out the recording if you're curious about self-supervised theories of biological and artificial intelligence, object discovery, and leveraging videos for scene understanding. Thanks @y_m_asano for the invitation!!

Yuki

@y_m_asano

2 years

Happy to share with you the QUvA Deep Vision Seminar talk from @olivierhenaff from @DeepMind on "The virtuous cycle of object discovery and representation" is now online: Enjoy😊:

0

2

32

1

2

Olivier Hénaff

@olivierhenaff

2 years

Surprisingly, VITO achieves this with a few simple changes to the standard contrastive paradigm: data curation, better augmentations, and attention pooling. This suggests there is plenty of room for video pretraining to become the new default for learning image representations.

1

2

Olivier Hénaff

@olivierhenaff

2 years

We present VITO, a self-supervised method that learns from the dynamic evolution of video frames. For the first time, VITO closes the gap with ImageNet pretraining on a range of scene understanding tasks (COCO, LVIS, PASCAL, and ADE20K).

1

3

1

Olivier Hénaff

@olivierhenaff

8 months

@NielsRogge @ibalazevic @YugeTen @pinelopip3 @Rahma_Chaa @skandakoppula For us, yes, but that is still quite a bit longer than what most VLM's are able to attend to! I agree however that we're in dire need of more long-video understanding benchmarks. Perception Test and EgoSchema were essential steps in this direction but we need to go further

1

0

1

Olivier Hénaff

@olivierhenaff

5 years

with @Aravind7694 @JeffreyDeFauw @catamorphist @CarlDoersch @arkitus @avdnoord

0

1