🚨 Big news 🚨
Together with a set of amazing folks we decided to start a company that tackles one of the hardest and most impactful problems - Physical Intelligence
In fact, we even named our company after that: or Pi (π) for short
🧵
Introducing RT-1, a robotic model that can execute over 700 instructions in the real world at 97% success rate!
Generalizes to new tasks✅
Robust to new environments and objects✅
Fast inference for real time control✅
Can absorb multi-robot data✅
Powers SayCan✅
🧵👇
Introducing the 540 billion parameter Pathways Language Model. Trained on two Cloud
#TPU
v4 pods, it achieves state-of-the-art performance on benchmarks and shows exciting capabilities like mathematical reasoning, code writing, and even explaining jokes.
Super excited to introduce SayCan (): 1st publication of a large effort we've been working on for 1+ years
Robots ground large language models in reality by acting as their eyes and hands while LLMs help robots execute long, abstract language instructions
Have you ever “heard” yourself talk in your head? Turns out it's a useful tool for robots too!
Introducing Inner Monologue: feeding continual textual feedback into LLMs allows robots to articulate a grounded “thought process” to execute long, abstract instructions 🧵👇
Very exited to announce our largest deep RL deployment to date: robots sorting trash end-to-end in real offices!
(aka RLS)
This project took a long time (started before SayCan/RT-1/other newer works) but the learnings from it have been really valuable.🧵
PaLM-E or GPT-4 can speak in many languages and understand images. What if they could speak robot actions?
Introducing RT-2: our new model that uses a VLM (up to 55B params) backbone and fine-tunes it to directly output robot actions!
Super excited to announce that I've started as an Adjunct Professor
@Stanford
!
I'll continue to work
@GoogleAI
but I'll also be spending some time at Stanford, where I'll be co-advising a few students and continue co-teaching CS 330 () 🧑🏫
Here are a few examples from our work in robotics that leverage bitter lesson 2.0
This is something that I believe we'll see a lot more of (including our own work in 2023)🧵
⚠️🇺🇦🇺🇦🇺🇦🇵🇱🇵🇱🇵🇱⚠️
If you have friends/family trying to escape Ukraine and are looking for shelter in Poland, please DM me.
My family and I can provide transport from the PL/RO border to my hometown (Koszalin) and shelter (+food/school/work etc) for as long as needed.
Our 2021 CS330 () lectures are online:
It was a pleasure to co-teach this class with
@chelseabfinn
. Topics incl. meta-learning, MTL, few-shot learning, deep RL (incl. multi-task, meta, goal-conditioned, hierarchical and offline RL)
Our most recent work showing bitter lesson 2.0 in action:
using diffusion models to augment robot data.
Introducing ROSIE:
Our robots can imagine new environments, objects and backgrounds! 🧵
Bitter lesson by
@RichardSSutton
is one of the most insightful essays on AI development of the last decades.
Recently, given our progress in robotics, I’ve been trying to predict what the next bitter lesson will be in robotics and how can we prevent it today.
Let me explain 🧵
Immigrants in the US are particularly vulnerable to
#COVID19
layoffs. If you're on an H1-B work visa and you get laid off you need to find another job within 60 days. Almost nobody will be hiring for the next 60 days. We need a change of this policy ASAP!
I've been in robotics and AI for 10+ years and this is the first time when it feels like there is at least the light at the end of the tunnel (for robotics, for AI it's more like we're staring at the sun)
We have some exciting updates to SayCan! Together with the updated paper, we're adding new resources to learn more about this work:
Interactive site:
Blog posts: and
Video:
Super excited to introduce SayCan (): 1st publication of a large effort we've been working on for 1+ years
Robots ground large language models in reality by acting as their eyes and hands while LLMs help robots execute long, abstract language instructions
AI drone racer finally beats human world champions!
Recipe:
• state estimation with KF + gate-based measurements as additional KF updates
• small deep RL policy trained in sim
• residual controllers on top
Nature article:
We are thrilled to share our groundbreaking paper published today in
@Nature
: "Champion-Level Drone Racing using Deep Reinforcement Learning." We introduce "Swift," the first autonomous vision-based drone that beat human world champions in several fair head-to-head races! PDF
If you want to understand why robotics is much harder than it seems,
@ericjang11
pointed me once to this essay that does a pretty good job explaining it:
Reality has a surprising amount of detail
Large language models are universal computers that operate on tokens.
They happened to be trained on language because we had a lot of it, but don't get it twisted - once trained, they are universal computers.
🚨 New (Offline) RL Method 🦾 🚨
Introducing Q-Transformer - new RL approach that works at scale with large models and many tasks.
This is the best method we found so far that works with demos and autonomous (also negative) data at large scale. 🧵
Offline RL strikes back! In our new Q-Transformer paper, we introduce a scalable framework for offline reinforcement learning using Transformers and autoregressive Q-Learning to learn from mixed-quality datasets!
Website and paper: 🧵
ImageNet moment for robotics is finally here 🚀🦾
Introducing RT-X model & Open X-Embodiment dataset:
Truly wonderful robotics community effort by 34 labs (173 authors!) led by
@QuanVng
showing that a general cross-embodied robotic brain is possible 🧵
Lastly, another work from our lab
@GoogleAI
, this time without any foundation models: just super fast, agile, state-of-the-art control using modern and classical methods 🦾
3. Agile catching:
One (somewhat forgotten) valuable aspect of RL is its ability to continuously improve.
We present a simple fine-tuning method that shows this in challenging (often ridiculous) real-world scenarios.
Paper:
Website:
Thread 👇
I heard a wonderful analogy for AI progress from
@demishassabis
:
It's hard to imagine the industrial revolution if we didn't happen to have dinosaurs under ground that we could use as fossil fuel. Without them, we'd need to wait to discover solar or nuclear.
It's hard to
In PIVOT we show how you can use VLMs without any fine-tuning for a wide range of control tasks - it's pretty crazy.
We're just at the beginning of discovering what VLMs can do for control.
website:
demo:
How do you get zero-shot robot control from VLMs?
Introducing Prompting with Iterative Visual Optimization, or PIVOT! It casts spatial reasoning tasks as VQA by visually annotating images, which VLMs can understand and answer.
Project website:
Growing up I had no idea that one can work on intelligent robots. Even in my undergrad (in robotics!), it seemed like all of robotics was about industrial automation.
It all changed one day when I was looking for a bathroom.
There are many reasons to work on embodied AI.
One of them is that active data collection seems to allow for much better learning than passive observation, even if both active and passive agents are exposed to the same stimuli.
Here are a few famous experiments on that
🧵
We put everything together to see what RT-1 can enable in combination with SayCan. To make this more challenging, we evaluate RT-1 in two kitchens. We get 67% in Kitchen1 and the same perf in unseen Kitchen2, both significantly better than the baselines. Example execution:
Robotics isn't easy even in the era of foundation models - you have to truly love it and be very committed, but it is possible.
Having said that, it feels like for the first time in my career I see a path to making robotics work. 🦾
.
@ilyasut
on why OpenAI gave up on robotics:
"Progress comes from the combination of compute and data.
There was no path to data from robotics.
I'd say that now it is possible to create a path forward, but one needs to really commit.
You could imagine a gradual path of
Training RL from scratch can be very hard, if there is any prior policy you can use to help, you should.
But using prior policies with value-based RL is difficult for various RL reasons.
Introducing Jump-Start RL: a simple, widely applicable method that addresses this problem.
A key challenge in
#ReinforcementLearning
is learning policies from scratch in environments with complex tasks. Read how a meta-algorithm, Jump Start Reinforcement Learning, uses prior policies to create a learning curriculum that improves performance →
1) We updated the underlying LLM to PaLM (), resulting in PaLM-SayCan. This resulted in an interesting trend:
Improving the underlying LLM resulted in much higher robotics (!) performance (halving the errors)
There is a new format for
@GoogleAI
(Brain) internships:
1) If you are in your final year, apply for an internship:
2) If you are at the beginning of your PhD, apply for a student researcher position:
Not many people realize (or share the belief) that with enough data, robotics will be solved.
If that's case, we simplified a very hard problem into a much easier problem:
robotics -> how to get a lot of robot data?
Super excited to announce this year’s Deep RL workshop
@NeurIPSConf
🎉🎉🎓🎓
website:
submission deadline: September 22nd, 11:59 PM PST
Some exciting changes that we made this year in 🧵👇
I think there might be a lot of imposter syndrome specifically around 3D rotations and coordinate transforms. So let it be known that I have and continue to struggle debugging these issues! Is it actually hard for everyone! It's ok if it feels hard!
For folks asking: yes, we fully open-sourced the RT-1/RT-2 dataset as part of
Not only you can train on that data now but you can also merge it with 60 other datasets that are all in the same format with colabs etc. showing how to do it.
Based on the success of large models in other fields, our goal was to build a model that acts as an “absorbent data sponge”.
It should get better with diverse, multi-task data. The more data the model can effectively absorb, the better it will be able to generalize.
RL has many different gifts and it's easy to get confused which ones are the most important, especially in the age of gigantic BC-based models.
Here is a short, non-exaustive list:
• mining hard negatives
• stitching (getting more out of the same data)
• optimizing an
I sure talk a lot about foundation models for robotics and how we can ride their wave 🌊 but this time we've trained our own:
We present PaLM-E: An Embodied Multimodal Language Model
Danny has some incredible results in 🧵👇 so hold on tight!
Very cool work from my colleagues at Robotics @ Google
If you want to have multimodal models, you don't have to train them, just have individual foundational models talk to each other
Who would've thought that best language for AI to communicate with itself will be our language
With multiple foundation models “talking to each other”, we can combine commonsense across domains, to do multimodal tasks like zero-shot video Q&A or image captioning, no finetuning needed.
Socratic Models:
website + code:
paper:
Robotics is becoming the next frontier of AI resaerch.
While there are many companies being built on top existing LLM APIs and there is another wave of virtual agents companies coming, robotics remains in the sweet spot of still having many open problems & huge impact to be had.
How to know if a new scientific trend is legit?
Professors are skeptical, while students can't stop talking about it.
Students, not professors, set trends.
Max Planck put it in much more morbid terms:
A new scientific truth does not triumph by convincing its opponents and
It's been a huge year for robotics and foundation models 🦾 🎉
Here's a thread showing some important (though def biased towards my team) works that happened in 2023:
Summary of robot table tennis work from our team:
This is a fairly learning complex system with multiple components and many lessons learned. I encourage you to take a look at the website for details.
I believe that the biggest application of the *multimodal* models will be robotics and embodied agents.
We're actively working on making this future happen.
@demishassabis
's quote in the Wired article on Gemini by
@willknight
below:
Specifying rewards in real-world RL is hard - unsupervised RL can change that!
We present Off-DADS, an off-policy version of our DADS algorithm that is sample-efficient enough to allow experiments on real robots.
Arxiv
Video: 1/3
The next work from our lab
@GoogleAI
and collaborators is on using LLMs for gait generation for quadrupeds:
2. Language to locomotion:
How would you specify low-level control commands to a quadruped? 👇
Many researchers have asked us about sharing our RT dataset and making it easier to participate in large-scale robot learning research.
We're working on it and we'll have some updates on this soon! 👀
As for the model - yup, it's a Transformer😊We tokenize images using early language fusion with FiLM layers added to ImageNet-pretrained EfficientNet. We use TokenLearner to limit # of tokens (15ms total inference time) and output action tokens for the arm and base actions.
@OpenAI
demo on interactive Codex that debug its own code and make corrections on the fly.
Coding is changing before our eyes🤯
We'll post some exciting updates on robotics applications of this tech in a few days, stay tuned!
Speaking of data, we collected over 130k demos (17+ months with 13 robots).
The current set of skills includes picking, placing, opening/closing drawers, getting items in and out drawers, placing elongated items up-right, knocking them over, pulling napkins and opening jars.
I've had a few students ask me whether it makes sense to work on LLMs in robotics and robot learning methods or is Google gonna solve it all.
I can't emphasize it enough - you definitely should! This is a very new space that you can explore way more efficiently than Google can.
Quite an emotional roller coaster with
#NeurIPS2019
: our initial submission had scores 9,7,6 which after rebuttal increased to 9,8,7. The paper eventually got rejected due to the intervention of the meta-reviewer who spotted a mistake in the derivation...
I usually talk about how robotics should ride the wave of foundation models, but yesterday I was told that another wave is coming: cheaper and more reliable robot hardware from Chinese manufacturers.
I hope that they could do for bi-arm robots what Unitree did for quadrupeds.
Looking forward to presenting Code as Policies this week
#ICRA2023
@ieee_ras_icra
!
Talk: Award Finalists 2, Tuesday 4PM
Poster: Thursday 9-10:40AM
Come see how code-writing LLMs enable robots to do novel & diverse tasks from language instructions!
AutoRT is here:
foundation models 🤝 robots at scale
We show how to use VLMs and LLMs to orchestrate a fleet of robots and allow 1:5 human:robot ratio 💪🦾
Led by
@keerthanpg
and
@AlexIrpan
Fun fact: you can now literally write Asimov's laws into the
In the last two years, large foundation models have proven capable of perceiving and reasoning about the world around us unlocking a key possibility for scaling robotics.
We introduce a AutoRT, a framework for orchestrating robotic agents in the wild using foundation models!
"This rebuttal resolved all of my concerns and answered all the remaining questions. In fact, this is an exemplary piece of work that should be posted on every forum and be taught in every grad school. This paper is truly perfect now, thank you!
I maintain my score. Weak accept."
🚨 🚨 Another new work showcasing bitter lesson 2.0 🚨 🚨
Introducing MOO:
We leverage vision-language models (VLMs) to allow robots to manipulate objects they've never interacted with, and in new environments, while learning end-to-end policies. 🧵
I sure talk a lot about foundation models for robotics and how we can ride their wave 🌊 but this time we've trained our own:
We present PaLM-E: An Embodied Multimodal Language Model
Danny has some incredible results in 🧵👇 so hold on tight!
What happens when we train the largest vision-language model and add in robot experiences?
The result is PaLM-E 🌴🤖, a 562-billion parameter, general-purpose, embodied visual-language generalist - across robotics, vision, and language.
Website:
#MarsHelicopter
is on the surface of Mars! Here is a picture of our a little drone prototype that we were using
@NASAJPL
to test our state estimation algorithms 5 yrs ago. It's come a long way!
Our new work on Grounded Decoding: shows how multiple models can participate in the LLM decoding process.
You can now ground your LLM token by token as long as you have a grounding model.
Imagine being an international student, you finished all your courses & you're getting close to graduation:
1. Your OPT (gives you a temp. work permit) will likely be cancelled.
2. H1B can be cancelled anytime too.
3. You can't really go home because of the pandemic.
And now:
New
@ICEgov
policy regarding F-1 visa international students is horrible & will hurt the US, students, and universities. Pushes universities to offer in-person classes even if unsafe or no pedagogical benefit, or students to leave US amidst pandemic and risk inability to return.
We're releasing one of the biggest real-world multi-task RL robotic datasets!
Our MT-Opt dataset has ~ 1M (!) RL trajectories and ~0.5M images used to train the MT-Opt Q function and the success detector.
See specs and more details:
The reason why physical AI is harder than generative AI is because of their respective customers.
Humans can easily forgive any errors, physics is unforgivable.
It seems like a recipe for a robot specialist is fairly clear:
• state estimation if you can
• deep RL with sim2real
• fine-tuning in real
Could be also done fully in real if you can collect the data.
It seems very different from robot generalist recipes of today.
My account was hacked over the weekend but it's secure now.
If you ever see me posting about crypto, it's either my account got hacked and you should contact me immediately or I need an intervention and you should contact me immediately
In SayCan, we showed how we can connect robot learning pipelines to large language models, bringing a lot of common sense knowledge to robotics.
The hope was that as the LLMs become better (which they seem to be consistently doing), it will have a positive effect on robotics.
Super excited to introduce SayCan (): 1st publication of a large effort we've been working on for 1+ years
Robots ground large language models in reality by acting as their eyes and hands while LLMs help robots execute long, abstract language instructions
Many people are talking about a ChatGPT moment in robotics. Before we achieve a ChatGPT moment, we need the Internet moment first.
I hope that RT-X can contribute to that:
Maybe the fact that LLMs appear intelligent says more about us than it does about AI.
Many of the words we say just come from statistics of what word comes after another and not from a conscious thought.
It's surprisingly difficult to say something that breaks the statistics.
Imitation learning (IL) is good at getting baseline performance from demos quickly but it has trouble improving. RL takes a long time initially but it can continuously improve.
In AW-Opt (), we develop an IL+RL algorithm to combine the benefits of both.🧵
Can we learn meaningful behaviors and their dynamics without any rewards? Yes! And we can solve new tasks zero-shot by using MPC to compose the learned skills.
Dynamics-Aware Discovery of Skills (DADS):
w/
@archit_sharma97
, Shane Gu,
@svlevine
,
@Vikashplus
Basic idea:
With prompt engineering and scoring we can use LLM to break down an instruction into small, actionable steps.
This is not enough though, the LLM doesn't know about the scene, embodiment and the situation it's in. It needs what we call an affordance function!
Every single time I'm stuck on a coding problem, I have to repeat to myself that I should just try parts of it in a colab to understand it.
Action drives understanding way more than passive thinking does
There has been a long debate on model-based vs model-free RL. The classic arguments include:
• MBRL has richer learning signal
• MBRL is task-independent
• MFRL optimizes what you care about
• MFRL is data inefficient
I don't think this dichotomy is quite right.
🧵
The version of RLHF that is used rn is rumored to be a weak RL algo (more of a filtered BC) that optimizes for a human-based reward model.
The next versions will likely use more powerful RL + optimize for more long-horizon rewards like a conversation outcome.
Resets are one of the most limiting, often under-emphasized requirements of current robotic RL methods. They are hard to automate and scale to multiple tasks.
We introduce Value-accelerated Persistent Reinforcement Learning (VaPRL) that tries to address this problem.
Remember the kitchen env from Relay Policy Learning? This time it's in real!
In DBAP, we create system that continuously, autonomously improves on many tasks.
It's not enough to give demos to bootstrap tasks, demos should also bootstrap practicing!
🧵👇