Excited to announce DreamerV3 🌍, a scalable and general RL algorithm that masters a wide range of applications with fixed hyperparameters!
Applied out of the box, it solves the Minecraft Diamond challenge without human data. 💎
👇 Thread
Introducing DreamerV3: the first general algorithm to collect diamonds in Minecraft from scratch - solving an important challenge in AI. 💎
It learns to master many domains without tuning, making reinforcement learning broadly applicable.
Find out more:
What I value most about Jupyter notebooks is having all results and figures together in a doc.
Today I'm releasing Python Handout, a package that lets you create docs with inline figures, images, videos directly from Python scripts. ✨ 📰 ✨
Thread 👇
The full training run of the A1 quadruped robot learning to walk from scratch in the real world in 1 hour! Made possible by training a world model online and planning inside of it. Excited to see what we can do next with this!
@philippswu
@AleEscontrela
@Ken_Goldberg
@pabbeel
A dream come true! We introduce DayDreamer, where we apply world models for fast end-to-end learning on 4 physical robots, without simulators.
We learn quadruped walking from scratch in 1 hour. We also learn to pick & place balls directly from pixels and sparse rewards 🤖🌏👇
Excited to share Director, a practical, general, and interpretable reinforcement learning algorithm for learning hierarchical behaviors from pixels!
Director explores and solves long-horizon tasks with very sparse rewards by breaking them down into internal subgoals.
Thread 👇
Excited to present Clockwork VAEs for video prediction!
Clockwork VAEs (CW-VAEs) leverage hierarchies of latent sequences, where higher levels tick slower. They learn long-term deps across 1000 frames, semantically separate content, and outperform strong video models.
👇 Thread
World models are the future and the future is now! 🌎🚀
Proud to share DreamerV2, the first agent that achieves human-level Atari performance by learning behaviors purely within a separately trained world model.
Paper:
Thread 👇
Excited to introduce Dynalang, an interactive agent that understands diverse types of language in visual environments! 🤖💬
By learning a multimodal world model 🌍, Dynalang understands task prompts, corrective feedback, simple manuals, hints about out of view objects, and more
People want AGI so when they see meaningful progress in AI they think it might be THE ONE MISSING KEY. Many now try to massage LLMs into "AGI" but it won't work. LLMs are far from AGI (🥲) and only 1 piece of the solution. Focus on the unsolved pieces would mean faster progress!
Excited to introduce Crafter! 🌴🤖💎
Crafter is a game that evaluates a wide range of agent abilities within a single env with visual inputs. It tests generalization, exploration, and long-term reasoning. Made for both, reward agents and unsupervised agents
Thread 👇
We introduce Dreamer, an RL agent that solves long-horizon tasks from images purely by latent imagination inside a world model. Dreamer improves over existing methods across 20 tasks.
paper
code
Thread 👇
Excited to share our Deep Planning Network (PlaNet), an RL agent planning in latent space to solve control task from pixels. Now with Google AI post, animated paper, and open source code.
Post:
Paper:
Code:
What objectives can an intelligent agent optimize?
In this 3 year collab, we categorized the possible objs. APD is a unifying principle that explains repr learning, reward, infogain exploration, empowerment, skill discovery, and niche seeking.
👇 Thread
Excited to share our Google AI Blog post on DreamerV2, the first RL agent based on a general world model to achieve human-level performance on the Atari benchmark! 🌏🤖🚀
Presenting DreamerV2, the first world model-based
#ReinforcementLearning
agent to achieve top-level performance on the Atari benchmark, learning general representations from images to discover successful behaviors in latent space. Read more at
Tried mixed precision yet? Took 10 min to set up and my model runs almost 2x faster with same results.
Vars and grads are still 32 bits so it usually doesn't affect predictive performance.
E.g. in TF2, set option and make all input to your layers float16 (data, RNN states, ..):
🌎 Excited to share a major update of the DreamerV3 agent!
A couple of smaller changes, more benchmarks, and substantially improved performance.
👇 Main differences from our earlier preprint:
Excited to announce DreamerV3 🌍, a scalable and general RL algorithm that masters a wide range of applications with fixed hyperparameters!
Applied out of the box, it solves the Minecraft Diamond challenge without human data. 💎
👇 Thread
Current video gen models are breathtaking! But they aren't that useful for acting yet: Prompt Sora with a photo & "find me a screwdriver" and it'll swing the camera to conveniently reveal one lying there for you but in reality there won't be one
Let me clear a *huge* misunderstanding here.
The generation of mostly realistic-looking videos from prompts *does not* indicate that a system understands the physical world.
Generation is very different from causal prediction from a world model.
The space of plausible videos is
I'm excited about large general agents but I don't quite understand this paper. Surely you can fit many experts into a transformer with BC. The difficulty is to then learn new tasks faster. But 1000 expert demos to swing up a cart pole is worse than training PPO from scratch?
Gato🐈a scalable generalist agent that uses a single transformer with exactly the same weights to play Atari, follow text instructions, caption images, chat with people, control a real robot arm, and more:
Paper: 1/
Excited to share Evaluating Agents without Rewards!
We compare intrinsic objectives with task reward and similarity to human players. Turns out they all correlate more w/ human than w/ reward. Two of them even correlate more w/ human than reward does.
👇
For practitioners and researchers who want to solve hard reinforcement learning tasks without having to tune any knobs, DreamerV3 is now available on GitHub! 🧑💻🤖
Runs on 1 GPU, supports image/proprio/both inputs, discrete/continuous actions, etc
Exploring worlds by planning for expected novelty is what originally motivated PlaNet and Dreamer. Excited to share Plan2Explore, a new RL agent that explores to learn an accurate world model 🌍, indep of any task. SOTA zero-shot on DMControl 🚀
Thread 👇
RL agents get specific to tasks they are trained on. What if we remove the task itself during training?
Turns out, a self-supervised planning agent can both explore efficiently & achieve SOTA on test tasks w/ zero or few samples in DMControl from images!
Current RL algorithms still struggle under partial observability, which is common e.g. in real 3D environments. Excited to introduce the Memory Maze benchmark, carefully designed for evaluating long-term memory of RL algorithms! 🏠🤖🚀
@jurgisp
@countzerozzz
Video prediction has seen great progress recently but long videos are still inconsistent, e.g. when moving around 3D scenes. I'm excited to share Temporally Consistent Video Transformer (TECO), a scalable transformer that substantially improves learning of long dependencies! 🚀
Excited to announce TECO, an efficient video prediction model that can generate long, temporally consistent video for complex datasets in 3D scenes such as DMLab, Minecraft, Habitat, and real-world video from Kinetics!
📜
🌐
(🧵)
A big day for Python! The steering council has decided to remove the GIL:
- Will unlock fast multithreading
- User code can stay exactly the same
- Experimental support planned for 3.13 (Oct 2024)
Very excited to present LEXA, a reinforcement learning agent that learns to achieve challenging goal images without any supervision, through forward-looking exploration with a world model 🌎🚀
How could we enable an agent to perform many tasks? Supervising for every new task is impractical.
We present Latent Explorer Achiever (LEXA) that explores by discovering goals far beyond the frontier and then achieves test tasks, specified via images, in a zero-shot manner.
Autograph turns Python if, while, assert, etc into the corresponding TensorFlow ops with a function decorator. This will make TensorFlow a lot faster to write and maintain, without sacrificing in-graph performance. Can't wait for the first stable release!
Proud to share our blog post on Dreamer, our latest scalable RL agent. Dreamer learns a world model from images & efficiently finds long-term behaviors by backprop through imagined states🚀
Post
Paper
Videos
RL shifts the question of what intelligent behavior is to finding a reward function. I think we should focus more on what environment and reward function rather than on what RL algorithm to use. Is there theory for how properties of env and reward affect the resulting behavior?
I've always felt that rewards/RL oversimplify behaviour. Maybe now there's a shift back to "Planning by Probabilistic Inference" (AISTATS, 2003):
Presents the simple idea that we condition actions on goals, with a desired return as a possible goal.
Planning from pixels using latent dynamics models (think sequential VAE). We solve cup catch, walker, and several other control tasks. Outperforms A3C and comparable to D4PG with 50x less experience.
#NeurIPS
Paper:
Website:
AI safety twitter: I asked for "mars rover cooking a meal in my apartment" but instead it's remodeling my whole apartment into a mars crater now
@images_ai
It's fantastic to see so many people being interested in task-agnostic RL and making it to our workshop yesterday. Feels well worth the effort to organize and like we actually did something good for the community :)
Recordings (starts at 23:30):
Updated my TensorFlow char-rnn to use a clean input pipeline. Also includes an interactive command line for generating text. See some text samples generated after training on ArXiv abstracts below.
The difference to Jupyter is that information flows only one way: from your script to the handout. No hidden state and no confusion about cell execution order. If this might be for you, please give it a try and let me know any feedback! 👉
@TalkRLPodcast
What I think is important are general inductive biases for learning from much less data, unbounded online improvement (beyond in context), latent goals for RL, compute efficient training, dealing with long sequences, sparse planning over relevant features, intrinsic exploration
Barista just handed me a cup with the words "I hope your day goes by fast." Funny how everyone assumes you don't like working. Makes me appreciate being a researcher and having found work that I love
After the A1 learned to walk in 1 hour, we started pushing the robot and applying external perturbations. Continuously learning in the real world, Dreamer adapts within 10 minutes to withstand pushes or quickly roll over and stand back up! No robots were harmed here
On a set of DMLab tasks, DreamerV3 exceeds IMPALA while using over 130x fewer environment steps. This demonstrates that the peak performance of DreamerV3 exceeds model-free algorithms, while reducing data requirements by two orders of magnitude. 🤖⚡
DreamerV3 is also the first algorithm to collect diamonds in Minecraft without human demonstrations or curricula, solving a big exploration challenge in AI. Here is the episode where it finds its first diamond, which happens at 30M env steps or 17 days of playtime. 🌴🏔️🛠️💎
Check out the paper with a lot of benchmarks!
Paper:
Website:
Code coming soon.
Big thanks to
@jurgisp
,
@jimmybajimmyba
, and
@countzerozzz
✨
Happy to answer questions and go into details 🙋
DreamerV3 learns a world model 🌐 that predicts abstract outcomes of actions and uses it to train long-horizon behaviors in imagination. Predictions in symlog space and percentile return normalization enables successful learning across domains with fixed hyperparameters.
@vokaysh
@DeepMind
It's a hard exploration problem: There are sooo many possible sequences of button presses, but only very few are meaningful and accomplish all the necessary intermediate tasks. Hence, the diamond challenge has been an AI competition for several years
The key contribution of DreamerV3 is an algorithm that works out of the box on new application domains, without having to adjust hyperparameters. This reduces the need for expert knowledge and computational resources, making reinforcement learning broadly applicable. 📊📈
@dan_s_becker
Most RL algos are domain-specific and require a lot of data, limiting then to tasks where data is cheap. The text domain is quite broad and LLMs are already useful but might run into the same problem when we want them to read PDFs, browse the web, help us with research, etc
This was a super fun collaboration with
@AleEscontrela
and
@philippswu
, who both did a fantastic job! Thanks to
@Ken_Goldberg
and
@pabbeel
for supporting the project! ✨To learn more, just ask or check out the links:
Website
Paper
@goodfellow_ian
Besides what's mentioned, it can help to initialize the output layer biases to the empirical class frequencies. Otherwise, it spends the first couple of epochs just learning those. I also like the idea of SkewFit by Pong et al. 2019
If you're using PlaNet or Dreamer and you have a slow GPU, you can often decrease the image resolution. The training curves for 64x64 and 32x32 look almost identical and the latter runs almost twice as fast. These two plots show eval performance and frames per second
Due to its robustness, we observe highly favorable scaling properties of DreamerV3. Increasing the model size directly translates not just to higher final performance but also improves data-efficiency! This gives us a path to scale up and solve harder problems. 📈🚀
To see if modern world models allow for fast robot learning, we train online on 4 robots. Starting on its back, the A1 quadruped learns to roll over, stand up, and walk in 1 hour without resets! Prior work required lots of simulation, footstep controllers, or reset policies
Releasing STEVE, our latest agent that learns both uncertainty aware dynamics model and Q-function. Improves sample efficiency over DDPG by an order of magnitude while solving more difficult tasks. Feedback welcome!
Paper:
Code:
Making models like these useful for acting requires separating actions from outcomes, like in the world models we use in RL. Then the agent can be optimistic about the actions but neutral about their outcomes, rather than being optimistic about the outcomes
Deep reinforcement learning often needs too much trial and error to be practical on physical robots, which means one needs to train in simulation first. But simulators don't capture the complexity of the real world and the resulting policies don't adapt to changes in the world
Maybe I'm missing something? Despite being disappointed by the results, this is still great engineering and I'm excited for their future follow-ups. If you want to see an agent that achieves new goals zero-shot (although in the same env), check out LEXA
How could we enable an agent to perform many tasks? Supervising for every new task is impractical.
We present Latent Explorer Achiever (LEXA) that explores by discovering goals far beyond the frontier and then achieves test tasks, specified via images, in a zero-shot manner.
Thanks for having me on for the second time, Robin. Fun chat about DreamerV3 and the future of RL, incl unsupervised approaches, hierarchical planning, and how these ideas will help the next generation of embodied agents and LLMs 🤖
Episode 42
@danijarh
on the DreamerV3 agent and world models, the Director agent and hierarchical RL, realtime RL on robots with DayDreamer, and his framework for unsupervised agent design!
Only two days since releasing Python Handout to generate reports, as alternative workflow to Jupyter notebooks.
@TDTneuro
has already updated their neuro analysis example gallery.
Rendered handouts:
Python scripts to generate them:
@hardmaru
@slashML
It could also be that coming up with these ideas is actually quite easy and implementing them at scale is what takes most of the effort
I made a video to summarize action and perception as divergence minimization!
The framework offers a unified perspective on many of the intrinsic objective functions used in deep RL and also connects them to the free energy principle.
Project DayDreamer applies Dreamer with default hyperparameters to learn on 4 physical robots, without simulators. No new algorithm --- we just added support for multiple input modalities and parallelized data collection and network updates to meet latency requirements
On two robot arms (UR5 and XArm) we learn to pick and place balls from sparse rewards. Dreamer needs to learn to localize the balls from images here. Within 8-10 hours, Dreamer approaches human performance. We found no previous RL method that succeeds here
The world model encodes sequences of sensory inputs, fusing them together into latent representations. It also predicts future representations and rewards given actions, which enables planning. We reconstruct the inputs as a rich learning signal and to allow human inspection
For more videos and details, check out the paper and website. We'll also make training curves and code available soon.
Paper PDF:
Project website:
Happy to answer any questions ✨
Thanks a lot
@pabbeel
,
@itfische
,
@kuanghueilee
!
World models have many compelling properties for robot learning, e.g. sample-efficiency and multi-task learning. Recent world model agents like Dreamer learn video games from small amounts of experience. But it's unclear if they allow for fast learning on physical robots
@mattecapu
Infomax. Maximizing mutual information between the agent (parameters, representations, actions, options, etc) and env (past & future sensory inputs) leads to general agents that perform unsupervised representation learning, exploration, and control
Just had my toughest border interview so far on the way to
#ICML2019
. The officer was interested in AI research and wanted a full summary of our PlaNet paper! 😂
Storing knowledge about the user is super important for AI to become more useful. But that'll make it hard to switch models. An independent platform for storing (+editing) preferences etc would be really useful
Excited to finally announce
@Letta_AI
!
The next frontier in AI is in the stateful layer above the base models - the "memory layer", or "LLM OS".
Letta's mission is to build this layer in the open (say "no" 🙅 to privatized chain of thought).
Cool work on skill discovery!
VIC (left): Skills that are predictable given the end state correspond to moving to different locations.
RVIC (right): Skills that are predictable given start and end state but not given end state alone correspond to moving in different directions.
Happy to share that the preprint for Relative Variational Intrinsic Control, an unsupervised method for learning relative, composable, affordance-like skills, is on arXiv today and will be presented at AAAI in February 2021.
@VladMnih
@Zergylord
@dwf
Check out the paper for details & GitHub for more resources!
- Baseline agents code (Docker)
- Baseline scores (JSON)
- Plotting code
- Human expert trajectories
Paper:
GitHub:
Happy to answer any questions
A dream come true! We introduce DayDreamer, where we apply world models for fast end-to-end learning on 4 physical robots, without simulators.
We learn quadruped walking from scratch in 1 hour. We also learn to pick & place balls directly from pixels and sparse rewards 🤖🌏👇
Dreamer learns behaviors inside the model using an actor critic algorithm. It is trained on latent rollouts without decoding inputs, which allows for large batch sizes of e.g. 16K+ time steps on 1 GPU. As the predictions are purely on-policy, we need no importance correction etc
@Varunufi
@DeepMind
Thanks! Yes, the main point of the algorithm is that it works out of the box on new problems, without needing experts to fiddle with it. So it's a big step towards optimizing real-world processes
Excellent summary of biological concepts that can help AI by
@SuryaGanguli
:
- Local learning rules
- Temporal processing
- Modularity
- Unsupervised learning
- Curriculum design
- Causal world models
- Energy efficiency
@karpathy
@AnthonyLewayne
It's a bug not a feature. We don't put spaces between compound words. Arbeitsunterbrechungsangst is exactly work_interruption_fear. It's not a common phrase or dictionary word but everybody (who can guess the word boundaries) understands it
Today we're releasing Video Prediction Rewards (VIPER 🐍), a simple yet powerful method for extracting rewards from video prediction models!
VIPER learns reward functions from raw videos, and generalizes to entirely new domains for which no training data is available
🧵 thread
@tyrell_turing
I agree and also don't think a discussion at the political level would be productive given how uncertain even experts are about what the measures to mitigate xrisk should be
@tyrell_turing
I think it's pretty much all representation learning.
More precisely it's all about learning world models.
And the main issue with that is how to represent multimodality in the prediction (because the world is not entirely predictable).
Check how fast an
@OpenAI
Gym environment runs with 1 line of Python:
python -c "import gym,time;d=10000;e=gym.make('Ant-v1');s=time.time();e.reset();[e.reset() if e.step(e.action_space.sample())[2] else 0 for _ in range(d)];print(d/(time.time()-s),'FPS')"
@ykilcher
@GoogleAI
@DeepMind
@UofT
@mo_norouzi
Well explained, thanks!
Two clarifications:
- KL balancing (prior vs posterior within the KL) is different from beta VAEs (reconstruction vs KL)
- The vectors of categoricals can in theory represent 32^32 different images so their capacity is quite large
Distill published a great introduction to Gaussian Processes! Explains the basics well and lets you play with interactive examples to build up intuition 🎚️💡