At ICLR 24 we proposed 𝐃𝐮𝐚𝐥-𝐑𝐋 as a powerful approch to RL and it is excitng to see the Deepmind project showing that it scales for 𝐋𝐋𝐌 𝐟𝐢𝐧𝐞𝐭𝐮𝐧𝐢𝐧𝐠.
For a thorough understandng of these methods (called ReCOIL and IV-Learn in our work) check out our paper. Link👇
Imitation is the foundation of
#LLM
training.
And it is a
#ReinforcementLearning
problem!
Compared to supervised learning, RL -here inverse RL- better exploits sequential structure, online data and further extracts rewards.
Beyond thrilled for our
@GoogleDeepMind
paper!
A
I‘ll be starting my Ph.D. in Computer Science at
@UTCompSci
in this fall working with
@scottniekum
on 🤖 Robot Learning! Looking forward to the move to Austin🎸!
Grateful and lucky to have found amazing mentors and friends at
@SCSatCMU
to make this possible.
Nice blog post on getting started in Dual RL, along with a walk-through example code (). Also shows connections to control-as-inference, and the Deepmind's paper on LLM fine-tuning with Inverse RL.
How can you use offline datasets to imitate when an expert only provides you with observation trajectories (without actions)? Ex: robot's data of prior interaction + some tutorial videos.
Our
#CoRL2024
paper gives a simple and principled off-policy algorithm - DILO!
In our new preprint 📜(), we show how a number of recent SOTA methods in Reinforcement and Imitation Learning can be unified as RL methods in dual space, utilizing a framework originally proposed by Nachum et al! J/w
@yayitsamyzhang
,
@scottniekum
(1/5)
In our new preprint 📜(), we show how a number of recent SOTA methods in Reinforcement and Imitation Learning can be unified as RL methods in dual space, utilizing a framework originally proposed by Nachum et al! J/w
@yayitsamyzhang
,
@scottniekum
(1/5)
Our new work to appear at TMLR proposes a more general framework for IRL that naturally incorporates both preferences and expert demonstrations! A summary in 🧵 below -
()
J/w with excellent collaborators
@AkankshaSaran
, Wonjoon Goo, and
@scottniekum
.
Looking forward to starting my summer as a research intern
@nvidia
/
@NVIDIAAI
! I will be working with
@DalalGal
and the team on exciting reinforcement learning problems.
How can you efficiently use a H-step lookahead policy to improve performance in Deep RL?
A 🧵 introducing our
#CoRL2021
paper "Learning Off-Policy with Online Planning" (LOOP) accepted as an Oral Talk.
j/w
@davheld
@Wenxuan_Zhou
Paper and Website:
1/n
From Devi Parikh's (
@deviparikh
) inspiring keynote at
#ICLR2024
- "My absence wouldn't change the course of the field moving forward, so I walked away. Cruising the wave...wouldn't materialize my drive".
Highly recommended:
I ll be at RLC 2024
@RL_Conference
co-hosting the workshop RL beyond rewards
@RLBRew_2024
. We have an amazing lineup of speakers and papers. Consider attending!
Excited to talk to people about unsupervised RL, preference based RL and efficient RL algorithm design
🏆BEST PAPER🏆 of the Behavior-driven Autonomous Driving in Unstructured Environments at
@iros2022
#IROS2022
goes to🥁
"Imitative Planning using Conditional Normalizing Flow"
Congratulations to the authors !
A single unified framework can be used to understand IL and RL algorithms and derive new ones. Curious yet? Come check out our ✨spotlight✨ work on Dual RL that will be presented in
@iclr_conf
session:
📅 Thu 9 May 10:45 - 12:45 p.m. CEST
Website:
Direct alignment algorithms (DAAs) are fast and have easy-to-tune hyperparameters, but they still suffer from a form of reward overoptimization*. We study this in detail 👇
If you are around at
@corl_conf
, I ll be presenting this work in the Oral session at 4:30pm GMT (11:30am EST, 8:30am PST) today and poster at 5:15pm GMT.
#CoRL2021
Goal-conditioned RL has deep connections with distribution matching. This connection if leveraged properly can be used to derive performant algorithms for GCRL. Come say hello at ICLR!
@iclr_conf
📅 Fri 10 May 10:45 - 12:45 p.m. CEST
Link:
Our cross-university(s) collaborative work on "Scaling laws for Reward Model Overoptimization in Direct Alignment Algorithms" is accepted at
@NeurIPSConf
!
After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms".
TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇
A single unified framework can be used to understand IL and RL algorithms and derive new ones. Curious yet? Come check out our ✨spotlight✨ work on Dual RL that will be presented in
@iclr_conf
session:
📅 Thu 9 May 10:45 - 12:45 p.m. CEST
Website:
Origin: Deep RL does not work we need tricks.
after 2 years: Let's find the theory behind these tricks and improve RL further
after 5 years: Oh RL can work if we remove those tricks?
Super excited that CrossQ got accepted at
@iclr_conf
! 🎉 We show how to effectively use
#BatchNorm
in
#RL
, yielding SOTA sample efficiency while staying as computationally efficient as SAC!
This is joined work with
@aditya_bhatt
🧵
#ICLR2024
Can we extract reward functions when only expert state density/samples are given? Our
#CORL_2020
paper derives an analytical gradient to match general f-divergence!
Equal contribution work with coauthors T. Ni, Y. Wang,
@the_tejus
,
@rl_agent
, B. Eysenbach
Looking for a strong offline GCRL method that also works with image observations? Our work SMoRe accepted at ICLR 24 (
@iclr_conf
) combines duality with occupancy matching to give a discriminator-free algorithm.
Our work on a unified approach to imitation learning and learning from preferences was featured on the Microsoft Research Blog! Try this method out () to obtain SOTA results in imitation learning. A collab with
@AkankshaSaran
, W. Goo,
@scottniekum
Learning from expert demonstrations and learning from behavior preferences have often been treated separately in imitation learning. A new IL framework leverages the strengths of each to improve sample efficiency and solve previously unsolvable tasks:
@McaleerStephen
My understanding is that preference can give you partial ordering even over suboptimal trajectories, helping to better infer an expert's intent and generalize better. Just knowing expert and Imitating would generalize poorly and be full of reward ambiguity,
Check out this method from our lab for a principled approach to GCRL developed with fundamentals from f-divergence matching.
Also shows the possibilities in designing new RL algos using the core techniques developed in our previous work f-IRL ()
In domains with sparse rewards, reward shaping is well known to speed up learning by providing a dense learning signal.
We introduce an alternate method, f-Policy Gradients (), to obtain optimal policies through distribution matching. (1/n)
#NeurIPS2023
Checkout Haoran's
@ryanxhr
neat extension of the Dual RL work making it a SOTA method for offline RL! Turns out semi-gradient optimization may not lead to proper convergence 😮
I will attend
#ICLR2024
next week, hoping to meet old and new friends in Vienna!🇦🇹
I will present "ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update" ✨spotlight✨
A simple modification (<20 line code) to DICE that makes it work!
Research shows humans think abt effect on future optimality when giving pref between trajectory segments. Turns out, under such a model, you can extract soft optimal policies without even doing RL but getting the same behavior for free! Learn more abt our work in this 🧵 below.
📣 Last Call for Papers
Deadline to the ICML 2024
@icmlconf
Workshop on Models of Human Feedback for AI Alignment is approaching fast at May 31st 🗓️ 11:59 AoE.
We are looking forward to receiving your submissions!
Goal-conditioned RL has deep connections with distribution matching. This connection if leveraged properly can be used to derive performant algorithms for GCRL. Come say hello at ICLR!
@iclr_conf
📅 Fri 10 May 10:45 - 12:45 p.m. CEST
Link:
Announcing the Workshop on Models of Human Feedback for AI Alignment at ICML 2024 (
@icmlconf
)! Check out the website for the call for papers and more details.
Website:
Submission Deadline: May 31, 2024
Unhappy about direct alignment/RLHF (eg. DPO) being limited to the bandit setting and looking for an alternative for a sequential setting? Come check out our work on Contrastive Preference Learning (CEST), a principled extension at ICLR
@iclr_conf
📅 Wed 8 May 4:30 — 6:30 p.m
After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms".
TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇
Google Deepmind presents Grandmaster-Level Chess Without Search
paper page:
largest model reaches a Lichess blitz Elo of 2895 against humans, and successfully solves a series of challenging chess puzzles, without any domain-specific tweaks or explicit
After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms".
TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇
Imitation learning is one of the most widely used methods in ML, but how does compute affect its performance?
We explore this question in the challenging game of NetHack and find our scaled-up agent to outperform prior SOTA by 2x!
[1/6]
A video is now online of our ICML Tutorial on Recent Advances in Population-Based Search for Deep Neural Networks: Quality Diversity, Indirect Encodings, and Open-Ended Algorithms. By myself,
@joelbot3000
and
@kenneth0stanley
We hope you find it valuable!
🥳RLBReW Social🥳 @ RLC (
@RL_Conference
)
Socials are 🧡 of any conference. We invite you to join us on the evening of August 9 to discuss wacky RL ideas, and find friends and collaborators!
RSVP here:
"New State of the Art AI Optimizer: Rectified Adam (RAdam). Improve your AI accuracy instantly versus Adam, & why it works"
It's been a long time since we've seen a new optimizer reliably beat the old favorites; this looks like a very encouraging approach!
@breadli428
The paper Dave suggested shows this is not really a problem for horizon ~1000, when you are in the trust region. But in general, I think the horizon should inversely affect the size of your trust region to show monotonic improvement
Creating a Zoo of Atari-Playing Agents to Catalyze the Understanding of Deep Reinforcement Learning. Great work led by Joel Lehman, with many excellent collaborators from Uber AI Labs, OpenAI, & Google Brain.
Blog:...
@SOURADIPCHAKR18
@g_k_swamy
@EugeneVinitsky
Might just be unaware, is there a proof or explanation for that? If you are thinking about images there are other things in play such as the parametriztion of outputs y, datasets (n imgs for 1 class vs 1 ans for 1 instrction) which might lead to overfitting or mode collapse here.
Oppenheimer makes you realize how close the scientific community was even without modern social media tools. Will the future look back at scientists discussing LLM's on Twitter when talking about AI?
Want to spend 4 -- 12 months building new RL algorithms?
Princeton RL Lab is hiring 1 -- 2 post-bach/pre-doc RAs
👇Apply by filling out the form below:
(We're also recruiting 1 postdoc and grad students. For these positions, see )
Maybe there is a sweet spot in between which combines System-2 + System-1 reasoning like what we proposed in LOOP (MPC with a terminal value) () and
@nicklashansen2
proposes in TD-MPC (improves upon similar principles adding a latent dynamics model)
Indeed, I do favor MPC over RL.
I've been making that point since at least 2016.
RL requires ridiculously large numbers of trials to learn any new task.
In contrast MPC is zero shot: If you have a good world model and a good task objective, MPC can solve new tasks without any
@m_wulfmeier
@GoogleDeepMind
It seems the derivation for the online and reformulated IQ-learn follows what we derived in our Dual RL () work previously! We also derive the dynamics regularization+Energy (MLE) interpretation. Curious about connection and differences here..
(4/5) You don't need an LLM to show this. The community can save computing and time by testing their algorithms on our simple Tree-MDP (bringing back the age-old practice of toy MDPs in RL). The toy MDP runs in colab taking seconds.
This week at NeurIPS: We show the existence of a Spectrum of Off-Policy Estimators (SOPE) between importance sampling methods and density-based methods such as DICE, enabling a principled bias-variance tradeoff for off-policy evaluation. 1/
@EugeneVinitsky
Recently went through this process. I liked reading the original paper initially but nothing beats trying it yourself in a few lines of code through llamaindex, etc to actually get the feel of whats possible and what is not
We can't wait to see you all in less than 4 weeks for our workshop. There is more to RL than just optimizing reward functions -- Our stellar list of speakers is ready to present interesting and diverse views on the topic! Come to the workshop and interact with them.
Come check out our work f-IRL: Inverse RL via State Marginal Matching at the interactive session
@corl_conf
.
Interactive Session
2020-11-16, 11:10 - 11:40 PST
Our paper also provides a number of interesting evaluations for DAAs (some of which were also hypothesized before) - the classification accuracy of the implied reward model is poorly correlated with win rate, scaling law, and much more.
Paper link:
(2/5) DAA's implement a loss that looks like supervised learning. So, where is the reward overoptimization coming from? Irrespective of KL - It turns out the loss function DAA's use is heavily underspecified and has potentially ♾ solutions. Check out the paper for details
RLHF has historically operated under the assumption that humans give pref according to the total reward of segments but this is not nec. true --- they also reason about cumulative suboptimality that the seg. will result in the future. See Knox et al. ()
2/n In this work, we study the H-step lookahead theoretically and empirically, presenting an instantiation ‘LOOP’ which improves performance in Online RL, Offline RL, and Safe RL.
3/n Naively using H-step lookahead makes value learning computationally difficult, so we propose an efficient framework "LOOP" which learns a value function off-policy and corrects for the actor-divergence via Actor Regularized Control (ARC).
@g_k_swamy
@EugeneVinitsky
I don't think I ever understood why you need more sophisticated techniques. Since all we are trying to do here is match conditionals, why is KL/BC insufficient? What advantages do a min/max or IPM like formulation in SPIN provide? Might just be missing something
(3/6) DILO's insight? We can leverage tools from duality to directly predict multi-step deviation from expert's occupancy in a purely off-policy way. This builds upon our prior work on Dual-RL ()
3/7 We propose that IRL can be viewed as a 2-player ranking game between a reward and the policy player. The reward's job is to satisfy the rankings in the dataset and policy maximizes this reward function. Expert demonstrations are simply preferred more than any other behavior.
If you’re in Amherst, MA for RLC 2024, come to the RLHF poster session where I’ll be presenting “Learning Action-based Representations using Invariance”
We show that you can bootstrap myopic state representations to capture features relevant to long-horizon control!
Link below!