We are releasing 4M-21 with a permissive license, including its source code and trained models. It's a pretty effective multimodal model that solves 10s of tasks & modalities. See the demo code, sample results, and the tokenizers of diverse modalities on the website.
IMO, the
We are releasing the 1st version of 4M, a framework for training multimodal foundation models across tens of modalities & tasks, based on scalable masked modeling.
Joint effort by
@EPFL_en
&
@Apple
.
4M: Massively Multimodal Masked Modeling
🌐
🧵1/n
Salaries of PhD students, postdocs, and professors in Europe and Switzerland. Numbers are in Euro. Special attention for future students and faculties 😜
We are releasing the 1st version of 4M, a framework for training multimodal foundation models across tens of modalities & tasks, based on scalable masked modeling.
Joint effort by
@EPFL_en
&
@Apple
.
4M: Massively Multimodal Masked Modeling
🌐
🧵1/n
Is it a good idea to train RL policies from raw pixels? Could visual priors about the world help RL? We just released the code of our Mid-Level Vision paper addressing these questions. Spoiler: using raw pixels doesn’t generalize! Play with the results at
We are releasing the 1st version of 4M, a framework for training multimodal foundation models across tens of modalities & tasks, based on scalable masked modeling.
Joint effort by
@EPFL_en
&
@Apple
.
4M: Massively Multimodal Masked Modeling
🌐
🧵1/n
We present MultiMAE at
#ECCV2022
on Wed. MultiMAE is a general multi-modal & multi-task pre-training strategy based on masked autoencoders. It shows notable results in cross-modal representation learning and transfer learning.
1/5
Progress on
#consistency
&
#multitask
learning. Existing methods give inconsistent results across tasks, even if joint trained. We developed a general method for learning w/ Cross-Task Consistency. It gave notable gains for anything we tried. Live
#demo
:
Happy to share that CLIPasso will be one of the best paper awards of
#SIGGRAPH
2022. Congrats to the entire team! CLIP turned out to be a powerful perceptual loss.
We released OMNIDATA: a pipeline for creating steerable vision datasets. It gives the user control over generating the desired dataset using real-world 3D scans. It bridges vision datasets (pre-recorded data) and simulators (online data generation). Demo:
Vision datasets (i.e. ImageNet) are usually collected once for a fixed task. But how do we know the choice of camera intrinsics, tasks, etc. is a good one?
Our ICCV paper on “steerable datasets” addresses this problem and gets 'human-level' surface normal preds along the way (⅓)
Is it possible to adapt a neural network on the fly at the test time to cope with distribution shifts? RNA does precisely that by creating a closed-loop feedback system. We will present it on Wed afternoon at
@ICCVConference
.
1/n
Next time someone tells you reaching "human-level" at task X is the holy grail in AI, show them this video. All it takes is making the task narrow enough and there is a way to brutally outperform humans already. Being as *broad* as humans/animals is the challenge.
I will hire again from the Summer
@EPFL
program this year. Several great projects came out of S
@E
interns in the past, eg CLIPasso (SIGGRAPH22 best paper), Omnidata (ICCV21). Apply if our interests align.
(this is for BS/MS interns. PhD visitors have another program)
The Summer
@EPFL
2023 application site is now open! 🎊 To apply, please visit the Summer
@EPFL
website: . The application deadline for all students is the Sunday closest to the 1st of December (anywhere on earth).
OpenBot is a step in the right direction. Massively scalable robotic platforms are great. I dream of an army of little (harmless!) robots running around visually exploring and making sense of the world.
We'll present at NeurIPS, today at 5pm CST. Spotlight
#1022
.
Effectively bringing sensory modalities to large models is one way to make them more grounded, and ultimately have a more complete World Model. This is a step in that direction hopefully, and more will come.
4M exhibits having learned a solid cross-modal representation. We can use the various modalities to probe how 4M reconciles unusual inputs by manipulating one part of it while keeping the remainder fixed.
(8/n)
Gibson Database of Spaces includes 572 buildings,1447 floors, and >2million ft². All real buildings scanned and
#3D
reconstructed. Worth a few years of human visual experience. Browse the spaces by videos & 3D cuts:
#perception
#robotics
#dataset
#vision
The point isn’t making big $$$ as a student/postdoc, but to live comfortably to allow focusing on research, rather than financial preoccupation. Especially if supporting a family. I think the general picture of the table remains true after considering living expense and variances
Exactly 3 years ago we proposed to
#CVPR
with
@ozansener
. Today glad to see the
@nature
article on importance of negative results. “one of the worst aspects of science today: its toxic definitions of success”.
Visual odometry is a basic function for embodied AI. At
#CVPR23
we will present a multi-modal & modality-invariant visual odometry framework called Visual Odometry Transformer (VOT). Also I give a talk on multi-modal learning on several projects at the Multiearth w/ on Mon.
🧵
Tomorrow at
@CVPR
, I'll give a talk about recent works on multi-modal and multi-task masked modeling for creating vision foundation models.
1:45 PM @ West 109 - 110
Progress on
#consistency
&
#multitask
learning. Existing methods give inconsistent results across tasks, even if joint trained. We developed a general method for learning w/ Cross-Task Consistency. It gave notable gains for anything we tried. Live
#demo
:
This gem never gets old. Great for a break from arxiv. It’s remarkable how much jargon education vs so little critical thinking training we receive in AI today. Watch the first minute and you’ll be sold. Science wisdom by
@ProfFeynman
.
We will present Task Discovery at
#NeurIPS
on Thur. Large NNs are known to fit any *training* labels. But learning from what labels would lead to *generalization*? Can we find such labels/tasks for an unlabeled dataset automatically? What would they mean?
What are the tasks that a neural net generalizes on? In our
#NeurIPS2022
paper, we introduce a Task Discovery 🔎 framework to approach this question and automatically find such tasks. We show how such tasks look and what they reveal about NNs.
🌐
🧵1/9
Classical sampling-based planning algorithms in robotics (eg RRT,PRM) are efficient, performant & interpretable. Are they useful in learning-based frameworks?
PALMER(
#NeurIPS22
,
#CoRL22
w) shows they can be effectively repurposed for learning-based frameworks & representations
🧵
Gibson environment's ~600 buildings mesh rendered directly in PyBullet physics engine! FPS >5000! Great work by
@erwincoumans
. Check here if you want to visit inside these buildings: . Erwin's PuBullet rendering:
There have been demos of “multimodal foundation model” results – but one with a demonstrable deep & broad understanding of the input like 4M’s is unprecedented. It’s not an image+text conversational model, but one that extracts a deeper understanding of the scene.
(2/n)
If you want to see more that
#turtlebot
and two finger gripper arms, Jamie Paik
@robotician
gave a keynote talk with fun videos at
#CoRL
2019 on soft robotics and intuitive interactions.
Tiny Images dataset (>1700 citations) was permanently taken down, due to (unintended) inclusion of inappropriate language and images, found by Prabhu&Birhane. Clearly, everything we do (and did) in computer vision is now under a bigger scrutiny magnifier!
We introduce a general approach for enforcing diversity in ensembles. It leads to notable improvements in
#robustness
on a wide range of tasks and datasets for
#adversarial
and non-adversarial shifts.
Joint work with
@oguzhanthefatih
and
@zamir_ar
Website:
Via this objective, MultiMAE learns cross-modal predictive coding. The video showcases an example, where we input only depth & two RGB patches. The hue of one patch is being changed. The model propagates the colors semantically and according to depth. More examples on webpage.3/5
4M exhibits having learned a solid cross-modal representation. We can use the various modalities to probe how 4M reconciles unusual inputs by manipulating one part of it while keeping the remainder fixed.
(8/n)
New York Times
@nytimes
article on home robotics, failures of the past, and (not-so-low-hanging) potentials for the future. Covered our Gibson environment too. "What Comes After the Roomba?"
No interaction with the world yet, but clearly some nontrivial muscle control and behavior is present. Always interesting to contemplate how much cognitive and control bias we are born with, before any learning occurs.
Crowning a successful Nature of Robotics exhibition, EPFL Pavilions would like to invite you to a guided virtual tour with the exhibition's curator, Giulia Bini.
Join us today at 6 PM on Instagram:
#virtualtour
#natureofrobotics
#epflpavilions
4M trains a single Transformer jointly on many diverse modalities. The key to making it scalable was relying on tokenization to remove modality-specific intricacies, then masking tokens from both the inputs and targets to encourage multimodal fusion & improve efficiency.
(3/n)
4M can perform compositional generation by weighting different conditions by different amounts, even negatively. This allows the user to control precisely how strongly or weakly a generated output should follow each condition.
(9/n)
@andrey_kurenkov
ImageNet pretraining doesn't work well if the task isn't based on object semantics (eg monocular 3D) or images aren't from internet users (ie Flickr, instagram, etc style). See taskonomy analysis & the works that apply ImageNet models on images coming from robot onboard cameras.
@docmilanfar
I empathize. Such itemized recipes exist because they’re tempting (to both the speaker and the audience). We like them because following them would provide a tangible path to greatness. We don’t want to believe often there isn’t any; and the lists are usually over generalization.
Through controlled ablations, we found that increasing the number of pre-training tasks generally improves transfer performance, got insights into the masking strategy, and observed promising scaling trends in terms of dataset and model size.
(13/n)
MultiMAE has a simple and efficient pre-training objective: mask out a large number of patches from multiple input modalities, and learn to reconstruct them from the remaining information. 2/5
@colinraffel
ImageNet performance is not a full representation of “learning from limited labeled data” though. The trends on other tasks (eg single image 3D) don’t quite hold up. There seem to be some ImageNet/object classification overfitting in methodologies.
4M has multimodal retrieval capabilities, by adding global embeddings of models like DINOv2 or ImageBind to the set of 4M modalities, that were not possible with the original networks. 4M effectively distilled contrastive models using a more generative objective.
(11/n)
We trained 4M on different kinds of image, semantic, and geometric metadata extracted from the pseudo labels, enabling a high degree of control over the generation process and strong potential for steerable data generation.
(10/n)
What?? According to the Supreme Court of the United States “Using copyrighted material in a dataset that is used to train a discriminative machine-learning algorithm is perfectly legal”
Besides the out-of-the-box capabilities, a 4M model can also be directly used as a ViT backbone. It exhibits strong transfer performance by outperforming MAE and MultiMAE on various standard vision benchmarks.
(12/n)
4M models can output any of the modalities conditioned on any other(s). To do that, we iteratively predict and sample tokens then add them back to the input. Once all tokens from a modality are predicted, we move on to the next modality.
(5/n)
We would need a large and diverse multimodal dataset to train such a model. Existing datasets are either too small or not diverse enough, so we instead start from image & text pairs then use off-the-shelf pseudo-labeling networks to generate the remaining modalities.
(4/n)
Though the unicorn of robotics might well be at a supermarket, construction site, or a warehouse—rather than a home. Related to the recent
@nytimes
article by
@markoff
, the piece on "The Hunt for Robot Unicorns" by
@IEEESpectrum
was a good read too
@Michael_J_Black
@docmilanfar
Agreed. I often tell students we have letters because some critical information is lost in common metrics and standardized tests (GPA, paper/citation count, school name, etc). That’s their purpose and they should serve it however it makes sense. Good for avoding survivorship bias
4M’s any-to-any generation and in-painting capabilities enable fine-grained multimodal generation and editing tasks. Such as performing semantic edits or grounding the generation in extracted intermediate modalities.
(7/n)
This approach makes it convenient to add new modalities from diverse formats (e.g. images, sequences, neural network feature maps, etc). We already trained models that can jointly operate on 20+ modalities/tasks and are adding more.
(6/n)
The method basically augments standard supervised learning objective w/ explicit cross-task consistency constraints. The constraints are learned from data; no need for differentiable or apriori known constraints. We start with a consistent "triangle" and extend to larger graphs.
MultiMAE is trained *entirely using pseudo labels*, making it applicable to any RGB dataset without any annotations. It can be flexibly transferred to tasks where more than just one modality is (optionally and arbitrarily) available, with notable performance benefits. 4/5
"I was not in that moment as a journalist or a woman going to put a headscarf on and somehow bind myself." CNN's
@amanpour
on refusing to wear a headscarf for her interview with Iran's president Ebrahim Raisi
@tsimonite
@SergeBelongie
@nisselson
The conclusion that simulation-to-reality gap is about to disappear is shortsighted, IMO. The biggest obstacle
#sim2real
faces is not photorealistic rendering, but matching the semantic complexity of real world in simulation. Good luck creating a full messy bedroom in simulation.
Cross-Task Consistency is quite useful for standard single-task learning too, not just multitask. Simple conclusion: instead of training your network to do X→Y1, train it to do X→Y1→Y2. It will fit the data better with improved Y1 predictions. We extend this to larger configs.
@MattNiessner
I see. Unsurprising. There is a disproportional focus on fixing the diversity issue close to the end of the pipeline (PhD student level, postdoc level, faculty level). That's way too late. Mostly fixes the cosmetics only. We need to start much earlier.
@AjdDavison
Well, just like with many other things, scaling up is one big issue🙂In terms of both scene size and the required density of images. I won’t be surprised if scaling brings in some of the classic mechanisms that are written off now. But things are moving fast in this space, so...
@abigail_e_see
@skynet_today
2. clickbait titles/pictures: they are probably the fastest way to get traffic but fast doesn’t mean good. Concise and descriptive > catchy and inaccurate. Be a responsible journalist/blogger/presenter, even if it costs you getting less attention in short run.
Using a closed-loop formulation is common in control theory and robotics for solving (hard) problems. RNA uses a side controller network (h) to interpret a feedback signal to adapt a given pre-trained network (f). It is implemented via inserting FiLM layers in f.
2/n
We experimented with a set of signals that are practical for real-world use. However, those signals are also imperfect, so in the paper we also perform controlled experiments using ideal signals to isolate the actual performance of RNA.
5/n
The experiments are on several tasks, eg depth, semantic segmentation, 3D reconstruction, ImageNet, & on a range of distribution shifts. We also provide a discussion on the landscape of related formulations.
Joint w/
@aseretys
,
@oguzhanthefatih
, Zahra
n/n
@zacharylipton
@IBM
What’s the “AI” in there? I read multiple articles (by
@IBM
& others) and this seems like a database integration mostly. The fact that they keep shoving the word “AI” in it to get attention and turn it into a PR campain is extra alarming if this really benefits the less fortunate
@igubins
It was just a random 0.25% sample of the full training dataset. The goal was to evaluate whether the trends hold under a low data regime too. We didn't think about putting the sample indexes on Github. We could. I believe any iid random sample would do.
@colinraffel
Talk titles are even more amazing!! "Learning Internal Reps From Multiple Tasks", "Identifying Relevant Tasks", "Where is Multitask Learning Useful?", "Combining supervised and unsupervised learning, where do we go from here?", "Continual Learning"
The side network h has ~5-20% of the number of parameters of f. It is trained to predict how f should be updated -- so it amortizes the optimization (takes only a feedforward pass), making it much (~30x) faster than performing test-time optimization using SGD (TTO).
3/n
@andrey_kurenkov
@Bschulz5
@elonmusk
@skynet_today
Scaling is easier than inventing. If we know how to make AGI, likely 2xAGI or 10xAGI is quick, so
@elonmusk
might be right on that. But the missing piece rn is the G in AGI. And I suspect we're inconceivably far from it. Otherwise nX human-level is already here for narrow tasks.
Those that found inaccuracies in the table according to your experinece, consider directly reporting the error to the source to update their stats: administration
@informatics
-europe.org. I sent them an email inviting them to look at the reported inaccuracies in this thread.
@hardmaru
@erwincoumans
A (quantitative) answer to the generalization question through a study is brewing. Sneak peak: perception and dynamics aspects should be viewed and analyzed separately wrt generalization. Their generalization traits don't appear to correlate strongly. (opportunity or threat?)
@LauTor83
@ArnoutDevos
brought that up, and I reported the error to the source a bit ago. My phd students dont quite get that total amount too, but seems like the reported numbers for all countries is higher (eg comments about Germany). Some tax/employment rate adjustment might be in play?
Introducing – Paragraphica! 📡📷
A camera that takes photos using location data. It describes the place you are at and then converts it into an AI-generated "photo".
See more here:
or try to take your own photo here: