The future is hard to anticipate! In our latest
#CVPR2021
paper, we introduce a framework for learning *what* is predictable in the future.
Rather than committing up front to categories to predict, our approach learns how to hedge the bet.
Learning unsupervised machine translation is easier if you open your eyes!
Image distributions create transitive relations between languages. This creates incidental supervision for learning multilingual representations on 50 unpaired languages
@Surisdi
What causes adversarial examples? Latest
#ECCV2020
paper from
@ChengzhiM
and Amogh shows that deep networks are vulnerable partly because they are trained on too few tasks. Just by increasing tasks, we strengthen robustness for each task individually.
Oops! Dave+Bo introduce a dataset of unconstrained videos showing unintentional action. We study self-supervised approaches for learning video representations of intentionality.
#CVPR2020
Poster 93, Tue 10am PST
Website:
Paper:
Our predictive model is hyperbolic, which naturally encodes hierarchical structure.
When the model is most confident, it will predict at a concrete level of the hierarchy. But when not confident, the *mean* solution automatically selects a higher level!
Learning from Unlabeled Video Workshop -- starting now!
First up: Andrea Vedaldi (Oxford) on Learning Representations and Geometry from Unlabelled Videos.
Got many replies. I don't believe the problem has to do with neural nets. The problem is the paradigm of supervised classification and closed datasets. We need models that learn from an open world, with self-supervision, never stop learning, and transfer between tasks.
With just a few hours of experimentation in the physical world, a robot can learn on its own to design and throw paper airplanes further than a person, and even learn to build robot grippers out of cheap paper.
No foundation models. No simulation. No language.
Humans can design tools to solve various real-world tasks, and so should embodied agents. We introduce PaperBot, a framework for learning to create and utilize paper-based tools directly in the real world.
Learning from Unlabeled Video Workshop -- starting now!
First up: Andrea Vedaldi (Oxford) on Learning Representations and Geometry from Unlabelled Videos.
Our new paper (w/
@Surisdi
,Dave) shows Transformers can meta-learn a process for language acquisition from vision. At inference, the policy adapts to new words and generalizes better.
#CVPR2020
Paper:
Talk: Mon 11:40am PST
I am so excited to be part of this dream team. We will be investigating the next generation of ML and predictive models for truly planetary scale problems. If you are passionate about cutting-edge ML coupled with societal impact, please apply to Columbia for various positions!
Hurricane Ida made one thing clear: we are not prepared for the extreme weather caused by
#climatechange
. A new climate modeling center is designed to improve climate projections and encourage societies to plan for the inevitable disruptions ahead.
@NSF
Sssshhh!! There is so much noise in cities today. Ruilin and Rundi introduce a new approach that removes ambient noise from audio, letting the speech come through loud and clear. Let's have a listen... 🔊Turn on your speakers!
🔗
Learn about Learning from Unlabeled Videos at
#CVPR2019
, Sunday in Room E, 9:00am
Fresh posters and keynotes: Antonio Torralba, Noah Snavely, Andrew Zisserman, Bill Freeman, Abhinav Gupta, Kristen Grauman
Announcing the Workshop on Learning from Unlabeled Video at CVPR 2019. Come for dynamite speakers, and stay for the abstracts! Abstract deadline is March 4. Topics include self-supervised learning, sound and vision, visual anticipation, active vision, etc
@jmhessel
“Self-supervised” is a rebranding for “unsupervised” to avoid confusing people who ask Qs like “how can LMs be unsupervised if you give them the next token to predict”? I dislike rebranding, but I dislike even more arguing about whether LMs are unsupervised. So,🤷♂️?
Our predictive model is hyperbolic, which naturally encodes hierarchical structure.
When the model is most confident, it will predict at a concrete level of the hierarchy. But when not confident, the *mean* solution automatically selects a higher level!
How can we tell "what is where" inside a container, after dropping something into it? Can we generate visual scenes from sound?
Excited to share our latest work: The Boombox: Visual Reconstruction from Acoustic Vibrations. ()
Excellent piece, but I disagree we should give up our datasets. To get commonsense and generalization, we should create rich & diverse multi-modal datasets that span huge number of tasks. Probably need new data collection means, eg interaction and self-supervision (not MTurk)
Congratulations to
@Surisdi
and Ruoshi Li on their
#CVPR2021
paper!! and checkout the video below for an hour long talk with all the details and results!
Most predictive models operate in Euclidean space. However, when there is uncertainty or multiple modes, the optimal solution is to regress the mean, which often lacks any interpretation.
Our idea: Let’s make the mean mean something!
Predictive models on physical robots learn rich features about their surroundings -- they learn about obstacles and even the policy of other robots. Latest paper with
@BoyuanChen1
and
@hodlipson
, out today!
Can a robot be empathetic?
@MechCU
Prof
@hodlipson
thinks so: his lab has created a robot that learns to visually predict how its partner robot will behave.
@Columbia
Hyperbolic geometry for machine learning and computer vision is a young and rapidly growing area. We are not the first to work with this geometry, and we will not be the last!
Code, models, data, visuals, and links to tutorials are on our project website:
@haldaume3
you can silently and randomly add questions that have a single, well-defined answer that you also know. then, discard all workers that fail those "quiz" questions
The main idea: Natural audio will contain intervals of silence, which we can leverage as incidental supervision for learning to denoise. By learning to first detect these pauses, we can estimate a profile for the noise, and suppress it throughout the audio.
Here’s an example. As the model observes more of the video, the future becomes more and more predictable.
Our model makes increasingly specific forecasts of the future.
Just by changing predictive models to work in hyperbolic space instead of Euclidean space, the model automatically learns to select the right level of abstraction under uncertainty!
Didac Suris (
@Surisdi
), one of our PhD students, won a Microsoft Research Fellowship (
@MSFTResearch
)! Learn more about him and his PhD experience here -
Recently released video generation models are amazing😍
How can we use them in robotics to learn generalizable visuomotor policies?
Come find out in my talks at these 4 CVPR workshops next week, where I will talk about recent works in 3D, generative models, and robotics!
While each language represents a bicycle with a different word, the underlying visual representation remains consistent. A bicycle has a similar appearance in the UK, France, Japan, and India. We leverage this natural property for translating unpaired languages.
@dimadamen
@fdellaert
@Oxford_VGG
Video should be on YouTube next week. Thanks everyone for attending and great questions, and especially Yale Song for leading the behind the scenes!
The approach finds very interesting transitive paths between languages via vision, which we show below. When there is a strong path, the final score is high (top row), and it's low when the path is not aligned well (bottom row)
Since hyperbolic space is continuous, the hierarchy is actually continuous as well! This lets us work with hierarchies of any depth. Here’s 3 levels deep.
We show pairwise performance between source and target languages. As you might expect, languages within the same family are easier to translate between. But our approach is language agnostic, and makes no assumptions on grammar or vocab.
The full dataset is available online!
You don't always need large datasets to do "real" research in DL (a common misconception). Take a look at CycleGan for a counter example. A beautiful paper, with a relatively small amount of data:
The predictions are initially near the origin of the space, which corresponds to predicting the “root” node of the hierarchy. But over time, the prediction moves closer to the boundary of the space, corresponding to more specific forecasts.
@CSProfKGD
@alfcnz
I ended up using Screenflow, and I found it fantastic. It jointly records your screen, audio, and webcam. There is a simple UI to create different scenes.