Harshit Sikchi Profile Banner
Harshit Sikchi Profile
Harshit Sikchi

@harshit_sikchi

882
Followers
1,043
Following
24
Media
292
Statuses

I study Reinforcement Learning; Currently at FAIR @AIatMeta , PhD @UTCompSci . Previously, @CMU_Robotics . Former Research Intern @NVIDIAAI @UberATG

Austin, TX
Joined July 2018
Don't wanna be here? Send us removal request.
Pinned Tweet
@harshit_sikchi
Harshit Sikchi
1 month
At ICLR 24 we proposed 𝐃𝐮𝐚𝐥-𝐑𝐋 as a powerful approch to RL and it is excitng to see the Deepmind project showing that it scales for 𝐋𝐋𝐌 𝐟𝐢𝐧𝐞𝐭𝐮𝐧𝐢𝐧𝐠. For a thorough understandng of these methods (called ReCOIL and IV-Learn in our work) check out our paper. Link👇
@m_wulfmeier
Markus Wulfmeier
2 months
Imitation is the foundation of #LLM training. And it is a #ReinforcementLearning problem! Compared to supervised learning, RL -here inverse RL- better exploits sequential structure, online data and further extracts rewards. Beyond thrilled for our @GoogleDeepMind paper! A
Tweet media one
Tweet media two
11
64
357
1
0
9
@harshit_sikchi
Harshit Sikchi
3 months
Maybe this alone made RLC worth it 🥹
Tweet media one
14
9
531
@harshit_sikchi
Harshit Sikchi
4 years
I‘ll be starting my Ph.D. in Computer Science at @UTCompSci in this fall working with @scottniekum on 🤖 Robot Learning! Looking forward to the move to Austin🎸! Grateful and lucky to have found amazing mentors and friends at @SCSatCMU to make this possible.
21
0
114
@harshit_sikchi
Harshit Sikchi
23 days
Nice blog post on getting started in Dual RL, along with a walk-through example code (). Also shows connections to control-as-inference, and the Deepmind's paper on LLM fine-tuning with Inverse RL.
@_RanW_
RanW
24 days
Been studying a little dual reinforcement learning with @harshit_sikchi 's help. Hope people find it useful.
Tweet media one
2
5
25
0
12
95
@harshit_sikchi
Harshit Sikchi
2 months
How can you use offline datasets to imitate when an expert only provides you with observation trajectories (without actions)? Ex: robot's data of prior interaction + some tutorial videos. Our #CoRL2024 paper gives a simple and principled off-policy algorithm - DILO!
2
8
72
@harshit_sikchi
Harshit Sikchi
3 months
Tweet media one
0
2
70
@harshit_sikchi
Harshit Sikchi
10 months
Dual RL has been accepted as a spotlight (top 5%) in ICLR 24! Detailed post soon
@harshit_sikchi
Harshit Sikchi
2 years
In our new preprint 📜(), we show how a number of recent SOTA methods in Reinforcement and Imitation Learning can be unified as RL methods in dual space, utilizing a framework originally proposed by Nachum et al! J/w @yayitsamyzhang , @scottniekum (1/5)
1
6
65
3
6
67
@harshit_sikchi
Harshit Sikchi
2 years
In our new preprint 📜(), we show how a number of recent SOTA methods in Reinforcement and Imitation Learning can be unified as RL methods in dual space, utilizing a framework originally proposed by Nachum et al! J/w @yayitsamyzhang , @scottniekum (1/5)
1
6
65
@harshit_sikchi
Harshit Sikchi
2 years
Our new work to appear at TMLR proposes a more general framework for IRL that naturally incorporates both preferences and expert demonstrations! A summary in 🧵 below - () J/w with excellent collaborators @AkankshaSaran , Wonjoon Goo, and @scottniekum .
3
5
48
@harshit_sikchi
Harshit Sikchi
2 years
Looking forward to starting my summer as a research intern @nvidia / @NVIDIAAI ! I will be working with @DalalGal and the team on exciting reinforcement learning problems.
4
1
41
@harshit_sikchi
Harshit Sikchi
3 years
How can you efficiently use a H-step lookahead policy to improve performance in Deep RL? A 🧵 introducing our #CoRL2021 paper "Learning Off-Policy with Online Planning" (LOOP) accepted as an Oral Talk. j/w @davheld @Wenxuan_Zhou Paper and Website: 1/n
Tweet media one
1
6
34
@harshit_sikchi
Harshit Sikchi
6 months
From Devi Parikh's ( @deviparikh ) inspiring keynote at #ICLR2024 - "My absence wouldn't change the course of the field moving forward, so I walked away. Cruising the wave...wouldn't materialize my drive". Highly recommended:
1
1
32
@harshit_sikchi
Harshit Sikchi
3 years
Well, in-person conferences are definitely more fun! #CoRL2021
Tweet media one
2
0
30
@harshit_sikchi
Harshit Sikchi
3 months
I ll be at RLC 2024 @RL_Conference co-hosting the workshop RL beyond rewards @RLBRew_2024 . We have an amazing lineup of speakers and papers. Consider attending! Excited to talk to people about unsupervised RL, preference based RL and efficient RL algorithm design
0
1
30
@harshit_sikchi
Harshit Sikchi
2 years
Happy that work from my previous internship at UberATG was recognized here! with Shubhankar, Cole, Eric, and Shivam
@BADUE22
BADUE'22
2 years
🏆BEST PAPER🏆 of the Behavior-driven Autonomous Driving in Unstructured Environments at @iros2022 #IROS2022 goes to🥁 "Imitative Planning using Conditional Normalizing Flow" Congratulations to the authors !
1
3
13
0
1
28
@harshit_sikchi
Harshit Sikchi
6 months
A single unified framework can be used to understand IL and RL algorithms and derive new ones. Curious yet? Come check out our ✨spotlight✨ work on Dual RL that will be presented in @iclr_conf session: 📅 Thu 9 May 10:45 - 12:45 p.m. CEST Website:
1
2
27
@harshit_sikchi
Harshit Sikchi
4 years
Successfully completed my master's thesis at CMU ✅ Had a wonderful experience supervised by @davheld working with @Wenxuan_Zhou at @CMU_Robotics !
@CSDatCMU
CMU Computer Science Department
4 years
Congrats @HariSikchi on successfully defending your Master’s Thesis! 🎉👊"Striving for Safety in Deep Reinforcement Learning"
2
1
11
6
1
28
@harshit_sikchi
Harshit Sikchi
3 months
A. Barto's: Dont be a cult. Too late?
@harshit_sikchi
Harshit Sikchi
3 months
Tweet media one
0
2
70
0
0
26
@harshit_sikchi
Harshit Sikchi
3 months
Direct alignment algorithms (DAAs) are fast and have easy-to-tune hyperparameters, but they still suffer from a form of reward overoptimization*. We study this in detail 👇
Tweet media one
1
4
24
@harshit_sikchi
Harshit Sikchi
3 years
If you are around at @corl_conf , I ll be presenting this work in the Oral session at 4:30pm GMT (11:30am EST, 8:30am PST) today and poster at 5:15pm GMT. #CoRL2021
@corl_conf
Conference on Robot Learning
3 years
Congratulations to #CoRL2021 best paper finalist, "Learning Off-Policy with Online Planning", Harshit Sikchi, Wenxuan Zhou, David Held. #robotics #learning #award #research
Tweet media one
7
25
138
0
1
23
@harshit_sikchi
Harshit Sikchi
6 months
Goal-conditioned RL has deep connections with distribution matching. This connection if leveraged properly can be used to derive performant algorithms for GCRL. Come say hello at ICLR! @iclr_conf 📅 Fri 10 May 10:45 - 12:45 p.m. CEST Link:
1
5
21
@harshit_sikchi
Harshit Sikchi
1 month
Our cross-university(s) collaborative work on "Scaling laws for Reward Model Overoptimization in Direct Alignment Algorithms" is accepted at @NeurIPSConf !
@rm_rafailov
Rafael Rafailov
3 months
After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms". TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇
Tweet media one
2
47
255
0
4
20
@harshit_sikchi
Harshit Sikchi
6 months
Presenting Dual RL at ICLR tomorrow from 10:45-12:45 am CEST at Poster #169 !
@harshit_sikchi
Harshit Sikchi
6 months
A single unified framework can be used to understand IL and RL algorithms and derive new ones. Curious yet? Come check out our ✨spotlight✨ work on Dual RL that will be presented in @iclr_conf session: 📅 Thu 9 May 10:45 - 12:45 p.m. CEST Website:
1
2
27
0
0
16
@harshit_sikchi
Harshit Sikchi
10 months
Origin: Deep RL does not work we need tricks. after 2 years: Let's find the theory behind these tricks and improve RL further after 5 years: Oh RL can work if we remove those tricks?
@DPalenicek
Daniel Palenicek
10 months
Super excited that CrossQ got accepted at @iclr_conf ! 🎉 We show how to effectively use #BatchNorm in #RL , yielding SOTA sample efficiency while staying as computationally efficient as SAC! This is joined work with @aditya_bhatt 🧵 #ICLR2024
Tweet media one
5
12
67
0
0
13
@harshit_sikchi
Harshit Sikchi
4 years
Can we extract reward functions when only expert state density/samples are given? Our #CORL_2020 paper derives an analytical gradient to match general f-divergence! Equal contribution work with coauthors T. Ni, Y. Wang, @the_tejus , @rl_agent , B. Eysenbach
1
2
14
@harshit_sikchi
Harshit Sikchi
10 months
Looking for a strong offline GCRL method that also works with image observations? Our work SMoRe accepted at ICLR 24 ( @iclr_conf ) combines duality with occupancy matching to give a discriminator-free algorithm.
1
0
13
@harshit_sikchi
Harshit Sikchi
2 years
Our work on a unified approach to imitation learning and learning from preferences was featured on the Microsoft Research Blog! Try this method out () to obtain SOTA results in imitation learning. A collab with @AkankshaSaran , W. Goo, @scottniekum
@MSFTResearch
Microsoft Research
2 years
Learning from expert demonstrations and learning from behavior preferences have often been treated separately in imitation learning. A new IL framework leverages the strengths of each to improve sample efficiency and solve previously unsolvable tasks:
1
2
27
0
3
13
@harshit_sikchi
Harshit Sikchi
10 months
@McaleerStephen My understanding is that preference can give you partial ordering even over suboptimal trajectories, helping to better infer an expert's intent and generalize better. Just knowing expert and Imitating would generalize poorly and be full of reward ambiguity,
3
0
10
@harshit_sikchi
Harshit Sikchi
1 year
Check out this method from our lab for a principled approach to GCRL developed with fundamentals from f-divergence matching. Also shows the possibilities in designing new RL algos using the core techniques developed in our previous work f-IRL ()
@agsidd10
Siddhant Agarwal
1 year
In domains with sparse rewards, reward shaping is well known to speed up learning by providing a dense learning signal. We introduce an alternate method, f-Policy Gradients (), to obtain optimal policies through distribution matching. (1/n) #NeurIPS2023
3
19
111
0
0
9
@harshit_sikchi
Harshit Sikchi
6 months
Checkout Haoran's @ryanxhr neat extension of the Dual RL work making it a SOTA method for offline RL! Turns out semi-gradient optimization may not lead to proper convergence 😮
@ryanxhr
Haoran Xu
6 months
I will attend #ICLR2024 next week, hoping to meet old and new friends in Vienna!🇦🇹 I will present "ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update" ✨spotlight✨ A simple modification (<20 line code) to DICE that makes it work!
Tweet media one
4
2
40
0
0
9
@harshit_sikchi
Harshit Sikchi
1 year
Research shows humans think abt effect on future optimality when giving pref between trajectory segments. Turns out, under such a model, you can extract soft optimal policies without even doing RL but getting the same behavior for free! Learn more abt our work in this 🧵 below.
@JoeyHejna
Joey Hejna
1 year
Excited to announced Contrastive Preference Learning (CPL), a simple RL-free method for RLHF that works with arbitrary MDPs and off-policy data. arXiv: With @rm_rafailov @harshit_sikchi @chelseabfinn @scottniekum W. Brad Knox @DorsaSadigh A thread🧵👇
1
27
201
1
0
8
@harshit_sikchi
Harshit Sikchi
11 months
Traveling to #NeurIPS2023 next week. Hit me up to chat about anything!
0
0
8
@harshit_sikchi
Harshit Sikchi
6 months
Want to discuss RL/Preference Learning/Imitation Learning? Say hi at ICLR @iclr_conf ! Please give me your food recommendations :D
2
0
8
@harshit_sikchi
Harshit Sikchi
5 years
Congratulations to @AGV_IITKGP from @IITKgp for achieving the second positions at the 27th IGVC competition.
Tweet media one
Tweet media two
0
1
7
@harshit_sikchi
Harshit Sikchi
5 months
Submit your ideas and work on AI alignment soon! The workshop venue ( Vienna 😍 @icmlconf ) should be a big plus
@mhf_icml2024
MHFAIA@ICML2024
5 months
📣 Last Call for Papers Deadline to the ICML 2024 @icmlconf Workshop on Models of Human Feedback for AI Alignment is approaching fast at May 31st 🗓️ 11:59 AoE. We are looking forward to receiving your submissions!
0
1
3
0
0
6
@harshit_sikchi
Harshit Sikchi
6 months
Presenting SMoRE at ICLR tomorrow (10 May) from 10:45-12:45 am CEST at Poster #142 ! #ICLR2024
@harshit_sikchi
Harshit Sikchi
6 months
Goal-conditioned RL has deep connections with distribution matching. This connection if leveraged properly can be used to derive performant algorithms for GCRL. Come say hello at ICLR! @iclr_conf 📅 Fri 10 May 10:45 - 12:45 p.m. CEST Link:
1
5
21
0
1
6
@harshit_sikchi
Harshit Sikchi
3 months
Lucky to attend the Alignment workshop in Vienna! This might be near the gold standard of what workshops at ML conferences should aspire to be.
Tweet media one
Tweet media two
0
0
6
@harshit_sikchi
Harshit Sikchi
28 days
100% recommend seeing this one. Take some time to reflect on the past that led to this moment in RL!
@RL_Conference
RL_Conference
28 days
"In the Beginning, ML was RL". Andrew Barto gave RLC 2024 an amazing overview of the intertwined history of ML and RL (Link below)
Tweet media one
5
70
412
0
0
5
@harshit_sikchi
Harshit Sikchi
6 months
Working on different aspects of AI Alignment? Consider submitting to this workshop at ICML 2024 to be held at Vienna, Austria!
@mhf_icml2024
MHFAIA@ICML2024
6 months
Announcing the Workshop on Models of Human Feedback for AI Alignment at ICML 2024 ( @icmlconf )! Check out the website for the call for papers and more details. Website: Submission Deadline: May 31, 2024
1
5
9
0
1
5
@harshit_sikchi
Harshit Sikchi
6 months
Unhappy about direct alignment/RLHF (eg. DPO) being limited to the bandit setting and looking for an alternative for a sequential setting? Come check out our work on Contrastive Preference Learning (CEST), a principled extension at ICLR @iclr_conf 📅 Wed 8 May 4:30 — 6:30 p.m
1
0
5
@harshit_sikchi
Harshit Sikchi
6 years
Wow.
@hardmaru
hardmaru
6 years
If you want some life inspiration, here is Yuri Orlov's resumé:
Tweet media one
4
78
399
0
0
5
@harshit_sikchi
Harshit Sikchi
3 months
Come checkout this work at the @RLBRew_2024 workshop at @RL_Conference today.
@rm_rafailov
Rafael Rafailov
3 months
After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms". TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇
Tweet media one
2
47
255
0
1
5
@harshit_sikchi
Harshit Sikchi
9 months
Distilling an oracle* chess solver that uses search on a large dataset is free of complex heuristics or search?
@_akhaliq
AK
9 months
Google Deepmind presents Grandmaster-Level Chess Without Search paper page: largest model reaches a Lichess blitz Elo of 2895 against humans, and successfully solves a series of challenging chess puzzles, without any domain-specific tweaks or explicit
Tweet media one
38
275
1K
0
0
4
@harshit_sikchi
Harshit Sikchi
3 months
See @rm_rafailov 's thread for a summary too!
@rm_rafailov
Rafael Rafailov
3 months
After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms". TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇
Tweet media one
2
47
255
0
0
4
@harshit_sikchi
Harshit Sikchi
1 year
Has it become common practice to mean behavior cloning when saying imitation learning? 😅 Interesting work btw!
@JensTuyls
Jens Tuyls
1 year
Imitation learning is one of the most widely used methods in ML, but how does compute affect its performance? We explore this question in the challenging game of NetHack and find our scaled-up agent to outperform prior SOTA by 2x! [1/6]
Tweet media one
2
19
107
1
0
4
@harshit_sikchi
Harshit Sikchi
5 years
Finally a proper discussion on the latest population based methods.
@jeffclune
Jeff Clune
5 years
A video is now online of our ICML Tutorial on Recent Advances in Population-Based Search for Deep Neural Networks: Quality Diversity, Indirect Encodings, and Open-Ended Algorithms. By myself, @joelbot3000 and @kenneth0stanley We hope you find it valuable!
2
25
71
0
0
4
@harshit_sikchi
Harshit Sikchi
1 year
Stanford's attempt to make peer reviewing obsolute 🥲
Tweet media one
0
0
4
@harshit_sikchi
Harshit Sikchi
1 year
@EugeneVinitsky no such thing as too much RL
0
0
4
@harshit_sikchi
Harshit Sikchi
3 months
Such a good way to start the conference - know the people attending already by meeting them here :P
@RLBRew_2024
RL Beyond Rewards Workshop
3 months
🥳RLBReW Social🥳 @ RLC ( @RL_Conference ) Socials are 🧡 of any conference. We invite you to join us on the evening of August 9 to discuss wacky RL ideas, and find friends and collaborators! RSVP here:
0
7
17
0
0
3
@harshit_sikchi
Harshit Sikchi
5 years
Something new to look forward to
@jeremyphoward
Jeremy Howard
5 years
"New State of the Art AI Optimizer: Rectified Adam (RAdam). Improve your AI accuracy instantly versus Adam, & why it works" It's been a long time since we've seen a new optimizer reliably beat the old favorites; this looks like a very encouraging approach!
19
830
2K
0
0
3
@harshit_sikchi
Harshit Sikchi
5 years
While learning "Go" you see this "Go has only one looping construct, the for loop." and think why the hell C had so many loop constructs.
0
0
3
@harshit_sikchi
Harshit Sikchi
1 year
@Aaroth @patrickmesana Very much looking forward to lecture recordings!
0
0
3
@harshit_sikchi
Harshit Sikchi
5 years
First team to qualify | IGVC Autonav challenge 😍
@RoboNationInc
RoboNation
5 years
@rishabhmadan96 @IITKanpur @IITKgp Yes, you are correct! Here's an updated list of qualified #IGVC teams:
Tweet media one
0
3
4
1
0
3
@harshit_sikchi
Harshit Sikchi
9 months
@breadli428 The paper Dave suggested shows this is not really a problem for horizon ~1000, when you are in the trust region. But in general, I think the horizon should inversely affect the size of your trust region to show monotonic improvement
0
0
3
@harshit_sikchi
Harshit Sikchi
6 years
A really useful work with cool tools
@EvolvingAI
Evolving AI Lab
6 years
Creating a Zoo of Atari-Playing Agents to Catalyze the Understanding of Deep Reinforcement Learning. Great work led by Joel Lehman, with many excellent collaborators from Uber AI Labs, OpenAI, & Google Brain. Blog:...
0
0
3
0
0
3
@harshit_sikchi
Harshit Sikchi
1 year
@viddivj You are here too! Lets catch up sometime
1
0
3
@harshit_sikchi
Harshit Sikchi
7 months
@SOURADIPCHAKR18 @g_k_swamy @EugeneVinitsky Might just be unaware, is there a proof or explanation for that? If you are thinking about images there are other things in play such as the parametriztion of outputs y, datasets (n imgs for 1 class vs 1 ans for 1 instrction) which might lead to overfitting or mode collapse here.
0
0
1
@harshit_sikchi
Harshit Sikchi
11 months
@sohamdesh_ @ada_rob @jesseengel Lol didnt see u the whole time here yet
1
0
1
@harshit_sikchi
Harshit Sikchi
1 year
Oppenheimer makes you realize how close the scientific community was even without modern social media tools. Will the future look back at scientists discussing LLM's on Twitter when talking about AI?
1
0
2
@harshit_sikchi
Harshit Sikchi
5 months
Don't miss this opportunity to work with Ben. His passion and research style were key ingredients in driving me to work in RL!
@ben_eysenbach
Ben Eysenbach
5 months
Want to spend 4 -- 12 months building new RL algorithms? Princeton RL Lab is hiring 1 -- 2 post-bach/pre-doc RAs 👇Apply by filling out the form below: (We're also recruiting 1 postdoc and grad students. For these positions, see )
Tweet media one
3
21
138
0
0
3
@harshit_sikchi
Harshit Sikchi
2 months
Maybe there is a sweet spot in between which combines System-2 + System-1 reasoning like what we proposed in LOOP (MPC with a terminal value) () and @nicklashansen2 proposes in TD-MPC (improves upon similar principles adding a latent dynamics model)
@ylecun
Yann LeCun
2 months
Indeed, I do favor MPC over RL. I've been making that point since at least 2016. RL requires ridiculously large numbers of trials to learn any new task. In contrast MPC is zero shot: If you have a good world model and a good task objective, MPC can solve new tasks without any
91
118
997
0
1
2
@harshit_sikchi
Harshit Sikchi
1 year
@satnam6502 Rooh was nice in Bay, Rooh Chicago even better!
0
0
2
@harshit_sikchi
Harshit Sikchi
9 months
@EugeneVinitsky Also RAG I think is a great place for RL
1
0
2
@harshit_sikchi
Harshit Sikchi
2 months
@m_wulfmeier @GoogleDeepMind It seems the derivation for the online and reformulated IQ-learn follows what we derived in our Dual RL () work previously! We also derive the dynamics regularization+Energy (MLE) interpretation. Curious about connection and differences here..
0
0
2
@harshit_sikchi
Harshit Sikchi
3 years
@rl_agent Congratulations Lisa!!
0
0
2
@harshit_sikchi
Harshit Sikchi
1 year
@kevin_zakka If only RL/robotics people started caring abt citations...
0
0
0
@harshit_sikchi
Harshit Sikchi
3 months
(4/5) You don't need an LLM to show this. The community can save computing and time by testing their algorithms on our simple Tree-MDP (bringing back the age-old practice of toy MDPs in RL). The toy MDP runs in colab taking seconds.
Tweet media one
1
0
2
@harshit_sikchi
Harshit Sikchi
3 years
Amazing work from labmate @christinajyuan . Especially enjoyed the simplicity of the nice idea and the way it is presented!
@scottniekum
Scott Niekum
3 years
This week at NeurIPS: We show the existence of a Spectrum of Off-Policy Estimators (SOPE) between importance sampling methods and density-based methods such as DICE, enabling a principled bias-variance tradeoff for off-policy evaluation. 1/
1
0
20
0
0
2
@harshit_sikchi
Harshit Sikchi
9 months
@EugeneVinitsky Recently went through this process. I liked reading the original paper initially but nothing beats trying it yourself in a few lines of code through llamaindex, etc to actually get the feel of whats possible and what is not
1
0
1
@harshit_sikchi
Harshit Sikchi
2 years
(5/5) Our imitation experiments on offline datasets show that ReCOIL performs well with mixed-quality datasets, even with low expert coverage!
0
0
2
@harshit_sikchi
Harshit Sikchi
4 months
RLBRew is all set to open RLC @RL_Conference with a bang. Hoping to see you all there 🤩
@RLBRew_2024
RL Beyond Rewards Workshop
4 months
We can't wait to see you all in less than 4 weeks for our workshop. There is more to RL than just optimizing reward functions -- Our stellar list of speakers is ready to present interesting and diverse views on the topic! Come to the workshop and interact with them.
Tweet media one
0
1
10
0
0
2
@harshit_sikchi
Harshit Sikchi
5 months
@McaleerStephen Congrats!!🎉
0
0
2
@harshit_sikchi
Harshit Sikchi
4 years
Come check out our work f-IRL: Inverse RL via State Marginal Matching at the interactive session @corl_conf . Interactive Session 2020-11-16, 11:10 - 11:40 PST
0
0
2
@harshit_sikchi
Harshit Sikchi
3 months
Our paper also provides a number of interesting evaluations for DAAs (some of which were also hypothesized before) - the classification accuracy of the implied reward model is poorly correlated with win rate, scaling law, and much more. Paper link:
1
0
2
@harshit_sikchi
Harshit Sikchi
3 years
With all these amazing people and venues😄
Tweet media one
Tweet media two
0
0
2
@harshit_sikchi
Harshit Sikchi
3 months
(2/5) DAA's implement a loss that looks like supervised learning. So, where is the reward overoptimization coming from? Irrespective of KL - It turns out the loss function DAA's use is heavily underspecified and has potentially ♾ solutions. Check out the paper for details
1
0
1
@harshit_sikchi
Harshit Sikchi
1 year
0
0
0
@harshit_sikchi
Harshit Sikchi
1 year
RLHF has historically operated under the assumption that humans give pref according to the total reward of segments but this is not nec. true --- they also reason about cumulative suboptimality that the seg. will result in the future. See Knox et al. ()
1
0
1
@harshit_sikchi
Harshit Sikchi
1 year
@sid09_singh @NVIDIAAI In the bay area too, lets catch up
0
0
0
@harshit_sikchi
Harshit Sikchi
3 years
2/n In this work, we study the H-step lookahead theoretically and empirically, presenting an instantiation ‘LOOP’ which improves performance in Online RL, Offline RL, and Safe RL.
Tweet media one
1
0
1
@harshit_sikchi
Harshit Sikchi
3 years
3/n Naively using H-step lookahead makes value learning computationally difficult, so we propose an efficient framework "LOOP" which learns a value function off-policy and corrects for the actor-divergence via Actor Regularized Control (ARC).
1
0
1
@harshit_sikchi
Harshit Sikchi
4 years
@JohnCLangford I wish there were positions for masters students as well :(
0
0
1
@harshit_sikchi
Harshit Sikchi
7 months
@g_k_swamy @EugeneVinitsky I don't think I ever understood why you need more sophisticated techniques. Since all we are trying to do here is match conditionals, why is KL/BC insufficient? What advantages do a min/max or IPM like formulation in SPIN provide? Might just be missing something
2
0
0
@harshit_sikchi
Harshit Sikchi
2 months
(3/6) DILO's insight? We can leverage tools from duality to directly predict multi-step deviation from expert's occupancy in a purely off-policy way. This builds upon our prior work on Dual-RL ()
1
0
1
@harshit_sikchi
Harshit Sikchi
3 months
@ShivaamVats Congratulations Shivam!!🎉🎉🤩
1
0
1
@harshit_sikchi
Harshit Sikchi
3 years
@rudrasohan @GoogleAI @GoogleIndia @fooobar Congratulations!! Amazing news🎉
1
0
1
@harshit_sikchi
Harshit Sikchi
1 year
Threads is on the play store now. Will it provide enough incentive for the academic twitter to move there is the question.
1
0
1
@harshit_sikchi
Harshit Sikchi
2 years
@mjethalia6 @therealweissman Good thing he's making it 'Hyderabadi'😂
0
0
1
@harshit_sikchi
Harshit Sikchi
4 months
@farairesearch The link seems to be down
0
0
1
@harshit_sikchi
Harshit Sikchi
2 years
3/7 We propose that IRL can be viewed as a 2-player ranking game between a reward and the policy player. The reward's job is to satisfy the rankings in the dataset and policy maximizes this reward function. Expert demonstrations are simply preferred more than any other behavior.
Tweet media one
1
0
1
@harshit_sikchi
Harshit Sikchi
3 months
Really cool and simple method for learning controllable representations in RL!
@maxbrudolph
Max Rudolph
3 months
If you’re in Amherst, MA for RLC 2024, come to the RLHF poster session where I’ll be presenting “Learning Action-based Representations using Invariance” We show that you can bootstrap myopic state representations to capture features relevant to long-horizon control! Link below!
1
4
20
0
0
1