Thrilled to announce our latest work on active data curation: joint example selection (JEST) drastically accelerates large-scale multimodal pretraining, surpassing previous SoTA (SigLIP) with 10x fewer iterations and FLOPS:
So excited to announce what we've been working on for the past ~year or so:
Active Learning Accelerates Large-Scale Visual Understanding
We show that model-based data selection efficiently and effectively speeds up classification- and multimodal pretraining by up to 50%
Self-supervised representation learning is greatly facilitated by the knowledge of objects and their layouts in real-world scenes. Rather than hard-coding these priors, with our new method Odin we found that objects can be discovered from the learned representations themselves.
In-context learning has revolutionized NLP, yet computer vision still relies on the cumbersome pre-train + fine-tune protocol. Introducing Hummingbird, a model capable of in-context scene understanding with a remarkably simple mechanism: memory retrieval.
What makes self-supervised learning from uncurated data so challenging? We identified the heavy-tailed distribution of image content as a limiting factor, and address it with Divide and Contrast. Leads to big gains in data-efficient and transfer learning.
How does the brain make predictions about its environment? We found primary visual cortex to straighten the temporal trajectories of natural videos, facilitating their extrapolation.
Self-supervised learning has made tremendous gains for simple visual tasks like recognizing objects, but interacting with the environment requires detecting, parsing, and understanding their geometry. Contrastive detection greatly facilitates these tasks.
How does the brain keep track of its uncertainty? We found neural gain variability to represent stimulus uncertainty across the visual hierarchy, and explain these findings with a stochastic variant of divisive normalization:
Humans and animals reason about events spanning days, weeks, and years, yet current CV systems live largely in the present.
Introducing Memory-Consolidated ViT, whose context extends far into the past and sets a new SOTA in long-video understanding with a 10x smaller model
With the advent of AI assistants, humans depend on their outputs at an unprecedented rate. It is vital that these models be aligned with human abilities, judgements, and preferences.
Self-supervised video pretraining yields a big step towards human-aligned visual representations
Thrilled to announce that we have an opening for a Student Researcher to come work with us at
@GoogleDeepMind
!
If you’re interested in multimodal learning, in-context adaptation, memory-augmented perception, or active learning, do consider applying:
Very happy to share our latest unsupervised representation learning work! In addition to SOTA linear classification, we beat supervised networks on ImageNet with 2-5x less labels and transfer to PASCAL detection better than supervised pre-training.
Understanding the world by watching it evolve over time might be the holy grail of self-supervised learning. Yet learning from videos has always lagged behind learning from images.
Until now!
It's a great pleasure to announce that the art installation I've been working on will be exhibiting at the Florence Trust in London. If you're interested in neural representations, memory/reality, or an interactive audio-visual experience, come check it out!
Are we done with ImageNet? Yes: we found the original labels to no longer be the best predictors of human preferences. But no: with our new label-set, we have removed much of the biases of the original, creating a better benchmark for future research.
There's been lots of great work in long-context video understanding, but most of it comes at the cost of greater conceptual & computational complexity
We instead opted for an ultra-minimalist approach, letting the model attend to past activations consolidated into a memory bank
To do this we developed a framework that allows a model to continuously filter the data it will learn from. In it, actors consume raw data, score it, and send it to the learner using prioritized replay.
Of the criteria we tried, example difficulty and learnability worked well...
Wondering what self-supervised learning from videos can do for your visual representations? Come find out at our
#NeurIPS2023
poster
#920
, from 5-7pm tonight!
Excited to share that our work on learning more human-aligned, robust visual representations via video pretraining has been accepted to
#Neurips2023
! Thanks so much to my collaborators
@olivierhenaff
@arkitus
and
@joaocarreira
! Stay tuned for the updated camera-ready version!
This means that our method not only accelerates learning in terms of learner iterations, but also total computation of the actors + learner. This is to our knowledge the first example of such "compute-positive" active learning
Odin learns object-level features with a contrastive objective and approximate image masks. It discovers objects by clustering those features, and feeds the resulting masks back into the learning objective, engaging in a virtuous cycle of representation and segmentation quality.
However scoring examples can be ~as expensive as learning from them! We therefore drastically scale down the size of the actors relative to the learner (up to 1000x), making them v cheap to run.
Example learnability (but not difficulty) was very robust to this actor downscaling
Finally, Odin requires no prior knowledge about the structure of objects in real-world scenes, raising the possibility of discovering the structure of arbitrary sensory modalities and their combinations.
Paper here:
Super excited for the
@ICCVConference
SSL Tutorial!! I'll be speaking about scene understanding, human alignment, and memory retrieval at 4:15pm Paris time, the full line-up is here: , starting soon in P02
Hey Ibrahim, thanks! We did a scaling study that showed that the effect seemed to be very robust wrt training duration (see below). Also in the multimodal regime we show nice gains over IID CLIP with decent compute budgets (eg ActiveCLIP w 8B examples >> OpenCLIP w 34B examples)
@olivierhenaff
Cool work! Have you plotted the gap vs # of seen examples for very long training durations (>>900M ex)? I'm wondering if active learning is beneficial for short training durations only, but the gap would shrink/vanish eventually (since you rely on small models).
All in all this makes me optimistic that large-scale pretraining might become much cheaper, fairer, and more accessible with data-selection criteria that generalize across models and tasks
In the multimodal setting, we find our method is highly "steerable": by computing example learnability (the difference in loss of the learner and a "reference" model) with ref. models trained on small curated datasets (LTIP), we can leverage much larger, noisier datasets (ALIGN)
@fhuszar
If training speed is a requirement, I would shamelessly recommend contrastive detection. It converges much faster than SimCLR and BYOL, and leads to SOTA transfer learning.
Paper:
Code:
It was very important to get a few details right, particularly the distribution of videos we learn from. Most video datasets have very imbalanced content. We fixed this with VideoNet, a new dataset with a uniform distribution across a diverse set of classes, similar to ImageNet
Contrastive detection works by carving images into pieces in an unsupervised manner, then learning from each simultaneously. This amplifies the learning signal per image, greatly accelerating convergence (up to 10x faster than SimCLR and BYOL).
Finally, VITO learns to attend to the same parts of images as humans do, in an entirely self-supervised manner, displaying an emergent alignment that is stronger than that of models explicitly trained for this
A bit belated, but happy to share that I defended my PhD thesis! Thanks to my advisor
@EeroSimoncelli
for all the support along the way! This week I'm excited to start a new chapter as a Research Scientist
@GoogleDeepMind
building models of vision with
@olivierhenaff
and others!
We train individual self-supervised experts on each of those subsets, then distill them back into a single model. Although this was designed for handling uncurated data, it also leads to big improvements in data-efficient ImageNet classification!
Super excited to chat with
@misssaxbys
about brains, machines, vision, memory, and art and what they can all learn from each other tomorrow at the
@MindstoneHQ
AI meetup tomorrow. Do stop by if you're in London!
This was such a good time, and a nice opportunity to reflect on why I think AI research can be used to deepen our understanding of the human experience.
Thanks so much
@misssaxbys
for inviting me and
@JoshuaWohle
for hosting!
Discussing memory, art + what AI can teach us about human perception and the mind in a church ⛪️
was certainly a career highlight.
A huge thank you to
@olivierhenaff
for sharing your insights and
@JoshuaWohle
for having us
Finally, when evaluating on long-video understanding benchmarks like EgoSchema and Perception Test, we find MC-ViT to outperform billion-scale VLM's like Flamingo and SeViLa despite using 10x fewer parameters
All in all this points to the promise of a much more active form of self-supervised learning.
This project was a thoroughly enjoyable collaboration with
@YonglongT
and
@avdnoord
!
All in all, this raises the possibility of simple architectures freely associating across time and modalities, realizing their potential as multimodal assistants and enriching our everyday experience
Contrastive methods benefit a lot from training on ImageNet, which contains many fine-grained classes that are almost perfectly balanced. Uncurated datasets lack this property, but we can approximately recover it by clustering self-supervised representations.
In summary, the amazing progress we’ve seen with LLMs is a reminder that, like humans, the most impactful ML systems are ones that are able to generalize across many tasks, rather than specializing to any given one. Hopefully computer vision can follow in this direction
Thanks to this, MC-ViT displays excellent scaling behavior when learning from long videos, outperforming joint space-time attention and efficient approximations with a 10x smaller memory footprint
Memory retrieval allows a single Hummingbird to perform multiple scene understanding tasks without modification. What else? Hummingbird retrieval is also much more data-efficient and capable of fast adaptation than fine-tuning or linear probing
This was all possible thanks to a fantastic collaboration with
@ibalazevic
, david steiner,
@nikparth1
, and
@relja_work
and represents our first foray into combining vision and memory research. Looking forward to more to come!
We quantify alignment with three signatures of human intelligence: generalization across tasks, robustness to perturbations, and consistency with human judgements. Across tasks, our self-supervised video model, VITO, matches specialist models designed specifically for each
Given a trained representation, making inferences about new images is very simple: just provide a prompt of annotated examples + retrieve the labels that are closest to each feature. This makes few assumptions about the labels and works for tasks like segmentation and depth pred
Together, these results raise the possibility of a new generation of learning algorithms that are more widely accessible, alleviating the need for large amounts of human annotation, and computation.
I’m thrilled that this work was accepted to
#ICCV21
as an oral presentation!
@DeepMind
So excited to see the open-source release of our self-supervised learning algorithm, DetCon. If you're interested in solving hard vision tasks in less time, check it out at
How does VITO do all of this? With a very simple self-supervised recipe: learning to track stable and distinctive content in videos with lightweight attention and contrastive learning. Crucially, however...
Previously, found we human observers to also straighten natural videos. Interestingly, this perceptual straightening was about 3x bigger than neural straightening in V1. Might further neural processing be required to match perceptual straightening?
The issue is that most pretraining methods (e.g. supervised or MoCo) don’t work well when decoded in this manner (without finetuning). In contrast Hummingbird uses cross-attention during training to prepare it for non-parametric eval, yielding large gains over the rest
Together, the diversity of VITO's benefits (task generality, robustness, human consistency) and the simplicity of its training recipe suggest that there is a great untapped potential in video pretraining as a paradigm for learning general, human-aligned visual representations
We quantified the predictability of natural videos using the curvature of the sequence of frames: in the pixel-domain, or as represented by a population of jointly-recorded V1 neurons. Comparing the two curvatures, we found they were much smaller in V1!
Finally, we fit a two-stage, linear-nonlinear model to neural responses, and found that it accounted for these effects *provided* we included the nonlinearities. V1 neurons could therefore be using their nonlinear computations to enable prediction in the natural environment.
@jimwinkens
@arkitus
Hey Jim! It does, because the online actor is only ever compared to a reference model of the same (very small) size, not the (much larger) learner. See also () for our scaling study
Hey Ibrahim, thanks! We did a scaling study that showed that the effect seemed to be very robust wrt training duration (see below). Also in the multimodal regime we show nice gains over IID CLIP with decent compute budgets (eg ActiveCLIP w 8B examples >> OpenCLIP w 34B examples)
@VictorButoi
Hi Victor, thanks! The prompts are the same as the training data used for transfer learning. Since we don't know ahead of time what classes an image will contain, it needs to attend to a prompt containing all classes
VITO is also much more robust to image deformations that are likely to occur in the real world, surpassing supervised, self-supervised, and adversarial image pretraining
@Le_Zhang7
@GoogleDeepMind
Hi Le, thanks for reaching out! The opening is for working with us in the London office, so you should apply for a "BS/MS/PhD placement in EMEA"
Congratulations to
@nikparth1
on a very successful internship project, and big thanks to
@joaocarreira
and
@arkitus
for another enjoyable collaboration!
Interestingly, VITO seems to learn to parse real world scenes by binding together content that co-occurs in time, building semantic correspondences between frames in a video and discovering high-level concepts.
@mayfer
Yes exactly, with consolidation happening with a very simple non-parametric reconstructive process like k-means or coreset selection. Lots more to explore here!
Neural straightening appeared to be specific to natural videos: when recording the responses to artificial videos, we found them to be *more* curved than in the pixel domain.
I had a great time speaking at UvA's Deep Vision Seminar, check out the recording if you're curious about self-supervised theories of biological and artificial intelligence, object discovery, and leveraging videos for scene understanding. Thanks
@y_m_asano
for the invitation!!
Happy to share with you the QUvA Deep Vision Seminar talk from
@olivierhenaff
from
@DeepMind
on
"The virtuous cycle of object discovery and representation" is now online:
Enjoy😊:
Surprisingly, VITO achieves this with a few simple changes to the standard contrastive paradigm: data curation, better augmentations, and attention pooling.
This suggests there is plenty of room for video pretraining to become the new default for learning image representations.
We present VITO, a self-supervised method that learns from the dynamic evolution of video frames.
For the first time, VITO closes the gap with ImageNet pretraining on a range of scene understanding tasks (COCO, LVIS, PASCAL, and ADE20K).
@NielsRogge
@ibalazevic
@YugeTen
@pinelopip3
@Rahma_Chaa
@skandakoppula
For us, yes, but that is still quite a bit longer than what most VLM's are able to attend to!
I agree however that we're in dire need of more long-video understanding benchmarks. Perception Test and EgoSchema were essential steps in this direction but we need to go further