Harshit Sikchi @harshit_sikchi Twitter profile | Pikagi

Pikagi

Harshit Sikchi

@harshit_sikchi

882

Followers

1,043

Following

24

Media

292

Statuses

I study Reinforcement Learning; Currently at FAIR @AIatMeta , PhD @UTCompSci . Previously, @CMU_Robotics . Former Research Intern @NVIDIAAI @UberATG

Austin, TX

https://t.co/5WC6sij2bF

Joined July 2018

Don't wanna be here? Send us removal request.

Pinned Tweet

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 month

At ICLR 24 we proposed 𝐃𝐮𝐚𝐥-𝐑𝐋 as a powerful approch to RL and it is excitng to see the Deepmind project showing that it scales for 𝐋𝐋𝐌 𝐟𝐢𝐧𝐞𝐭𝐮𝐧𝐢𝐧𝐠. For a thorough understandng of these methods (called ReCOIL and IV-Learn in our work) check out our paper. Link👇

@m_wulfmeier

Markus Wulfmeier

2 months

Imitation is the foundation of #LLM training. And it is a #ReinforcementLearning problem! Compared to supervised learning, RL -here inverse RL- better exploits sequential structure, online data and further extracts rewards. Beyond thrilled for our @GoogleDeepMind paper! A

Tweet media one

Tweet media two

11

64

357

1

0

9

Last Seen Profiles

@9is_9iz

@LindaCheungUK

@Snopey_

@Anthony_TCS_

@Apollo_Foxxo

@NeonOneTech

@cecdudis

@penikma89063156

@Shuvaji97064089

@hinamei_59

@Zarik_dolaknm

@MerlindaSh48729

@parewa3316

@FutureFuturede

@WKLMRadio

@EmilyOneid46741

@ABakk

@Z_i7a2

@JakitaT92033

@Aliens

@kyuriosu_nico

@IsauroT16963

@CarolCharlesCTV

@uesugizz

@salty_blondee

@KAJINmh

@Angiexogie1

@Th3_Ed

@mrclark44

@joe1upp

@Mark19236565

@stwmaniax

@OTDChile

@victor

@chomreadalot

@stw_pdg

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

Maybe this alone made RLC worth it 🥹

Tweet media one

14

9

531

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

4 years

I‘ll be starting my Ph.D. in Computer Science at @UTCompSci in this fall working with @scottniekum on 🤖 Robot Learning! Looking forward to the move to Austin🎸! Grateful and lucky to have found amazing mentors and friends at @SCSatCMU to make this possible.

21

0

114

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

23 days

Nice blog post on getting started in Dual RL, along with a walk-through example code (). Also shows connections to control-as-inference, and the Deepmind's paper on LLM fine-tuning with Inverse RL.

Tweet card media

Dual reinforcement learning

Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources

@_RanW_

RanW

24 days

Been studying a little dual reinforcement learning with @harshit_sikchi 's help. Hope people find it useful.

Tweet media one

2

5

25

0

12

95

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 months

How can you use offline datasets to imitate when an expert only provides you with observation trajectories (without actions)? Ex: robot's data of prior interaction + some tutorial videos. Our #CoRL2024 paper gives a simple and principled off-policy algorithm - DILO!

2

8

72

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

Yes! @RL_Conference

Tweet media one

0

2

70

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

10 months

Dual RL has been accepted as a spotlight (top 5%) in ICLR 24! Detailed post soon

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

In our new preprint 📜(), we show how a number of recent SOTA methods in Reinforcement and Imitation Learning can be unified as RL methods in dual space, utilizing a framework originally proposed by Nachum et al! J/w @yayitsamyzhang , @scottniekum (1/5)

1

6

65

3

6

67

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

In our new preprint 📜(), we show how a number of recent SOTA methods in Reinforcement and Imitation Learning can be unified as RL methods in dual space, utilizing a framework originally proposed by Nachum et al! J/w @yayitsamyzhang , @scottniekum (1/5)

Tweet card media

Dual RL: Unification and New Methods for Reinforcement and...

The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of...

1

6

65

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

Our new work to appear at TMLR proposes a more general framework for IRL that naturally incorporates both preferences and expert demonstrations! A summary in 🧵 below - () J/w with excellent collaborators @AkankshaSaran , Wonjoon Goo, and @scottniekum .

Tweet card media

A Ranking Game for Imitation Learning

We propose a new framework for imitation learning---treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise...

3

5

48

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

Looking forward to starting my summer as a research intern @nvidia / @NVIDIAAI ! I will be working with @DalalGal and the team on exciting reinforcement learning problems.

4

1

41

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

4 months

I'll be at ICML 2024 @icmlconf co-hosting the workshop () . Hit me up to chat about RL, Preference learning/RLHF, and sightseeing/food!

Tweet card media

Models of Human Feedback for AI Alignment ICML 2024

Important Dates Paper Submission Deadline (OpenReview): May 31, 2024 Acceptance Notification: June 17, 2024 Camera-Ready Deadline: June 25, 2024 Workshop: July 26, 2024

sites.google.com

1

1

35

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 years

How can you efficiently use a H-step lookahead policy to improve performance in Deep RL? A 🧵 introducing our #CoRL2021 paper "Learning Off-Policy with Online Planning" (LOOP) accepted as an Oral Talk. j/w @davheld @Wenxuan_Zhou Paper and Website: 1/n

Tweet media one

1

6

34

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

From Devi Parikh's ( @deviparikh ) inspiring keynote at #ICLR2024 - "My absence wouldn't change the course of the field moving forward, so I walked away. Cruising the wave...wouldn't materialize my drive". Highly recommended:

1

1

32

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 years

Well, in-person conferences are definitely more fun! #CoRL2021

Tweet media one

2

0

30

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

I ll be at RLC 2024 @RL_Conference co-hosting the workshop RL beyond rewards @RLBRew_2024 . We have an amazing lineup of speakers and papers. Consider attending! Excited to talk to people about unsupervised RL, preference based RL and efficient RL algorithm design

0

1

30

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

Happy that work from my previous internship at UberATG was recognized here! with Shubhankar, Cole, Eric, and Shivam

@BADUE22

BADUE'22

2 years

🏆BEST PAPER🏆 of the Behavior-driven Autonomous Driving in Unstructured Environments at @iros2022 #IROS2022 goes to🥁 "Imitative Planning using Conditional Normalizing Flow" Congratulations to the authors !

1

3

13

0

1

28

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

A single unified framework can be used to understand IL and RL algorithms and derive new ones. Curious yet? Come check out our ✨spotlight✨ work on Dual RL that will be presented in @iclr_conf session: 📅 Thu 9 May 10:45 - 12:45 p.m. CEST Website:

1

2

27

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

4 years

Successfully completed my master's thesis at CMU ✅ Had a wonderful experience supervised by @davheld working with @Wenxuan_Zhou at @CMU_Robotics !

@CSDatCMU

CMU Computer Science Department

4 years

Congrats @HariSikchi on successfully defending your Master’s Thesis! 🎉👊"Striving for Safety in Deep Reinforcement Learning"

2

1

11

6

1

28

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

A. Barto's: Dont be a cult. Too late?

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

Yes! @RL_Conference

Tweet media one

0

2

70

0

0

26

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

Direct alignment algorithms (DAAs) are fast and have easy-to-tune hyperparameters, but they still suffer from a form of reward overoptimization*. We study this in detail 👇

Tweet media one

1

4

24

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 years

If you are around at @corl_conf , I ll be presenting this work in the Oral session at 4:30pm GMT (11:30am EST, 8:30am PST) today and poster at 5:15pm GMT. #CoRL2021

@corl_conf

Conference on Robot Learning

3 years

Congratulations to #CoRL2021 best paper finalist, "Learning Off-Policy with Online Planning", Harshit Sikchi, Wenxuan Zhou, David Held. #robotics #learning #award #research

Tweet media one

7

25

138

0

1

23

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

Goal-conditioned RL has deep connections with distribution matching. This connection if leveraged properly can be used to derive performant algorithms for GCRL. Come say hello at ICLR! @iclr_conf 📅 Fri 10 May 10:45 - 12:45 p.m. CEST Link:

1

5

21

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 month

Our cross-university(s) collaborative work on "Scaling laws for Reward Model Overoptimization in Direct Alignment Algorithms" is accepted at @NeurIPSConf !

@rm_rafailov

Rafael Rafailov

3 months

After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms". TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇

Tweet media one

2

47

255

0

4

20

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

Presenting Dual RL at ICLR tomorrow from 10:45-12:45 am CEST at Poster #169 !

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

A single unified framework can be used to understand IL and RL algorithms and derive new ones. Curious yet? Come check out our ✨spotlight✨ work on Dual RL that will be presented in @iclr_conf session: 📅 Thu 9 May 10:45 - 12:45 p.m. CEST Website:

1

2

27

0

0

16

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

10 months

Origin: Deep RL does not work we need tricks. after 2 years: Let's find the theory behind these tricks and improve RL further after 5 years: Oh RL can work if we remove those tricks?

@DPalenicek

Daniel Palenicek

10 months

Super excited that CrossQ got accepted at @iclr_conf ! 🎉 We show how to effectively use #BatchNorm in #RL , yielding SOTA sample efficiency while staying as computationally efficient as SAC! This is joined work with @aditya_bhatt 🧵 #ICLR2024

Tweet media one

5

12

67

0

0

13

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

4 years

Can we extract reward functions when only expert state density/samples are given? Our #CORL_2020 paper derives an analytical gradient to match general f-divergence! Equal contribution work with coauthors T. Ni, Y. Wang, @the_tejus , @rl_agent , B. Eysenbach

1

2

14

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

10 months

Looking for a strong offline GCRL method that also works with image observations? Our work SMoRe accepted at ICLR 24 ( @iclr_conf ) combines duality with occupancy matching to give a discriminator-free algorithm.

1

0

13

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

Our work on a unified approach to imitation learning and learning from preferences was featured on the Microsoft Research Blog! Try this method out () to obtain SOTA results in imitation learning. A collab with @AkankshaSaran , W. Goo, @scottniekum

GitHub - hari-sikchi/rank-game: SOTA algorithms for imitation learning (LfD and LfO) - Ranking...

SOTA algorithms for imitation learning (LfD and LfO) - Ranking algorithms for imitation learning (TMLR 2023) - hari-sikchi/rank-game

@MSFTResearch

Microsoft Research

2 years

Learning from expert demonstrations and learning from behavior preferences have often been treated separately in imitation learning. A new IL framework leverages the strengths of each to improve sample efficiency and solve previously unsolvable tasks:

1

2

27

0

3

13

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

10 months

@McaleerStephen My understanding is that preference can give you partial ordering even over suboptimal trajectories, helping to better infer an expert's intent and generalize better. Just knowing expert and Imitating would generalize poorly and be full of reward ambiguity,

3

0

10

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

Check out this method from our lab for a principled approach to GCRL developed with fundamentals from f-divergence matching. Also shows the possibilities in designing new RL algos using the core techniques developed in our previous work f-IRL ()

Tweet card media

f-IRL: Inverse Reinforcement Learning via State Marginal Matching

Imitation learning is well-suited for robotic tasks where it is difficult to directly program the behavior or specify a cost for optimal control. In this work, we propose a method for learning the...

@agsidd10

Siddhant Agarwal

1 year

In domains with sparse rewards, reward shaping is well known to speed up learning by providing a dense learning signal. We introduce an alternate method, f-Policy Gradients (), to obtain optimal policies through distribution matching. (1/n) #NeurIPS2023

3

19

111

0

0

9

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

Checkout Haoran's @ryanxhr neat extension of the Dual RL work making it a SOTA method for offline RL! Turns out semi-gradient optimization may not lead to proper convergence 😮

@ryanxhr

Haoran Xu

6 months

I will attend #ICLR2024 next week, hoping to meet old and new friends in Vienna!🇦🇹 I will present "ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update" ✨spotlight✨ A simple modification (<20 line code) to DICE that makes it work!

Tweet media one

4

2

40

0

0

9

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

Research shows humans think abt effect on future optimality when giving pref between trajectory segments. Turns out, under such a model, you can extract soft optimal policies without even doing RL but getting the same behavior for free! Learn more abt our work in this 🧵 below.

@JoeyHejna

Joey Hejna

1 year

Excited to announced Contrastive Preference Learning (CPL), a simple RL-free method for RLHF that works with arbitrary MDPs and off-policy data. arXiv: With @rm_rafailov @harshit_sikchi @chelseabfinn @scottniekum W. Brad Knox @DorsaSadigh A thread🧵👇

1

27

201

1

0

8

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

11 months

Traveling to #NeurIPS2023 next week. Hit me up to chat about anything!

0

0

8

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

Want to discuss RL/Preference Learning/Imitation Learning? Say hi at ICLR @iclr_conf ! Please give me your food recommendations :D

2

0

8

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

5 years

Congratulations to @AGV_IITKGP from @IITKgp for achieving the second positions at the 27th IGVC competition.

Tweet media one

Tweet media two

0

1

7

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

5 months

Submit your ideas and work on AI alignment soon! The workshop venue ( Vienna 😍 @icmlconf ) should be a big plus

@mhf_icml2024

MHFAIA@ICML2024

5 months

📣 Last Call for Papers Deadline to the ICML 2024 @icmlconf Workshop on Models of Human Feedback for AI Alignment is approaching fast at May 31st 🗓️ 11:59 AoE. We are looking forward to receiving your submissions!

0

1

3

0

0

6

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

Presenting SMoRE at ICLR tomorrow (10 May) from 10:45-12:45 am CEST at Poster #142 ! #ICLR2024

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

Goal-conditioned RL has deep connections with distribution matching. This connection if leveraged properly can be used to derive performant algorithms for GCRL. Come say hello at ICLR! @iclr_conf 📅 Fri 10 May 10:45 - 12:45 p.m. CEST Link:

1

5

21

0

1

6

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

Lucky to attend the Alignment workshop in Vienna! This might be near the gold standard of what workshops at ML conferences should aspire to be.

Tweet media one

Tweet media two

0

0

6

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

28 days

100% recommend seeing this one. Take some time to reflect on the past that led to this moment in RL!

@RL_Conference

RL_Conference

28 days

"In the Beginning, ML was RL". Andrew Barto gave RLC 2024 an amazing overview of the intertwined history of ML and RL (Link below)

Tweet media one

5

70

412

0

0

5

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

Working on different aspects of AI Alignment? Consider submitting to this workshop at ICML 2024 to be held at Vienna, Austria!

@mhf_icml2024

MHFAIA@ICML2024

6 months

Announcing the Workshop on Models of Human Feedback for AI Alignment at ICML 2024 ( @icmlconf )! Check out the website for the call for papers and more details. Website: Submission Deadline: May 31, 2024

1

5

9

0

1

5

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 months

Unhappy about direct alignment/RLHF (eg. DPO) being limited to the bandit setting and looking for an alternative for a sequential setting? Come check out our work on Contrastive Preference Learning (CEST), a principled extension at ICLR @iclr_conf 📅 Wed 8 May 4:30 — 6:30 p.m

1

0

5

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 years

Wow.

@hardmaru

hardmaru

6 years

If you want some life inspiration, here is Yuri Orlov's resumé:

Tweet media one

4

78

399

0

0

5

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

Come checkout this work at the @RLBRew_2024 workshop at @RL_Conference today.

@rm_rafailov

Rafael Rafailov

3 months

After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms". TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇

Tweet media one

2

47

255

0

1

5

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

9 months

Distilling an oracle* chess solver that uses search on a large dataset is free of complex heuristics or search?

@_akhaliq

AK

9 months

Google Deepmind presents Grandmaster-Level Chess Without Search paper page: largest model reaches a Lichess blitz Elo of 2895 against humans, and successfully solves a series of challenging chess puzzles, without any domain-specific tweaks or explicit

Tweet media one

38

275

1K

0

0

4

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

See @rm_rafailov 's thread for a summary too!

@rm_rafailov

Rafael Rafailov

3 months

After the LLaMa 3.1 release and ICML, I wan to highlight our paper "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms". TL;DR we explore the dynamics of over-optimization in DPO/IPO/SLiC and find similiar "reward hacking" issues as online RLHF.👇

Tweet media one

2

47

255

0

0

4

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

Has it become common practice to mean behavior cloning when saying imitation learning? 😅 Interesting work btw!

@JensTuyls

Jens Tuyls

1 year

Imitation learning is one of the most widely used methods in ML, but how does compute affect its performance? We explore this question in the challenging game of NetHack and find our scaled-up agent to outperform prior SOTA by 2x! [1/6]

Tweet media one

2

19

107

1

0

4

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

7/7 For more information, check out our paper () and the project page (). We will be releasing code for the project on the page soon!

Tweet card media

A Ranking Game for Imitation Learning

We propose a new framework for imitation learning---treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise...

0

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

5 years

Finally a proper discussion on the latest population based methods.

@jeffclune

Jeff Clune

5 years

A video is now online of our ICML Tutorial on Recent Advances in Population-Based Search for Deep Neural Networks: Quality Diversity, Indirect Encodings, and Open-Ended Algorithms. By myself, @joelbot3000 and @kenneth0stanley We hope you find it valuable!

2

25

71

0

0

4

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

Stanford's attempt to make peer reviewing obsolute 🥲

Tweet media one

0

0

4

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

@EugeneVinitsky no such thing as too much RL

0

0

4

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

Such a good way to start the conference - know the people attending already by meeting them here :P

@RLBRew_2024

RL Beyond Rewards Workshop

3 months

🥳RLBReW Social🥳 @ RLC ( @RL_Conference ) Socials are 🧡 of any conference. We invite you to join us on the evening of August 9 to discuss wacky RL ideas, and find friends and collaborators! RSVP here:

0

7

17

0

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

5 years

Something new to look forward to

@jeremyphoward

Jeremy Howard

5 years

"New State of the Art AI Optimizer: Rectified Adam (RAdam). Improve your AI accuracy instantly versus Adam, & why it works" It's been a long time since we've seen a new optimizer reliably beat the old favorites; this looks like a very encouraging approach!

19

830

2K

0

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

5 years

While learning "Go" you see this "Go has only one looping construct, the for loop." and think why the hell C had so many loop constructs.

0

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

@Aaroth @patrickmesana Very much looking forward to lecture recordings!

0

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

5 years

First team to qualify | IGVC Autonav challenge 😍

@RoboNationInc

RoboNation

5 years

@rishabhmadan96 @IITKanpur @IITKgp Yes, you are correct! Here's an updated list of qualified #IGVC teams:

Tweet media one

0

3

4

1

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

9 months

@breadli428 The paper Dave suggested shows this is not really a problem for horizon ~1000, when you are in the trust region. But in general, I think the horizon should inversely affect the size of your trust region to show monotonic improvement

0

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

6 years

A really useful work with cool tools

@EvolvingAI

Evolving AI Lab

6 years

Creating a Zoo of Atari-Playing Agents to Catalyze the Understanding of Deep Reinforcement Learning. Great work led by Joel Lehman, with many excellent collaborators from Uber AI Labs, OpenAI, & Google Brain. Blog:...

0

0

3

0

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

@viddivj You are here too! Lets catch up sometime

1

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

7 months

@SOURADIPCHAKR18 @g_k_swamy @EugeneVinitsky Might just be unaware, is there a proof or explanation for that? If you are thinking about images there are other things in play such as the parametriztion of outputs y, datasets (n imgs for 1 class vs 1 ans for 1 instrction) which might lead to overfitting or mode collapse here.

0

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

11 months

@sohamdesh_ @ada_rob @jesseengel Lol didnt see u the whole time here yet

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

Oppenheimer makes you realize how close the scientific community was even without modern social media tools. Will the future look back at scientists discussing LLM's on Twitter when talking about AI?

1

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

5 months

Don't miss this opportunity to work with Ben. His passion and research style were key ingredients in driving me to work in RL!

@ben_eysenbach

Ben Eysenbach

5 months

Want to spend 4 -- 12 months building new RL algorithms? Princeton RL Lab is hiring 1 -- 2 post-bach/pre-doc RAs 👇Apply by filling out the form below: (We're also recruiting 1 postdoc and grad students. For these positions, see )

Tweet media one

3

21

138

0

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

10 months

Find the paper here:

Dual RL: Unification and New Methods for Reinforcement and...

The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of...

0

0

3

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

@ben_eysenbach @Princeton @mldcmu @rsalakhu @svlevine @PrincetonCS Great news! Congratulations🎉

0

0

0

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

@agarwl_ Some papers that come to mind:

Tweet card media

Model-Augmented Actor-Critic: Backpropagating through Paths

Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator to augment the data for policy optimization or value function learning. In this paper,...

0

1

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 months

Maybe there is a sweet spot in between which combines System-2 + System-1 reasoning like what we proposed in LOOP (MPC with a terminal value) () and @nicklashansen2 proposes in TD-MPC (improves upon similar principles adding a latent dynamics model)

Tweet card media

Improving RL with Lookahead: Learning Off-Policy with Online Planning

Overview of LOOP: LOOP reduces dependency on value errors by using an H-step Lookahead Policy that plans online using learned dynamics with a terminal value function. The value function is efficien...

blog.ml.cmu.edu

@ylecun

Yann LeCun

2 months

Indeed, I do favor MPC over RL. I've been making that point since at least 2016. RL requires ridiculously large numbers of trials to learn any new task. In contrast MPC is zero shot: If you have a good world model and a good task objective, MPC can solve new tasks without any

91

118

997

0

1

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

@satnam6502 Rooh was nice in Bay, Rooh Chicago even better!

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

9 months

@EugeneVinitsky Also RAG I think is a great place for RL

1

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

4 years

Check out the video explaining the paper here:

Tweet card media

f-IRL: Inverse Reinforcement Learning via State Marginal Matching |...

How to extract a reward function when we only have access to expert observations/state marginals?

www.youtube.com

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 months

@m_wulfmeier @GoogleDeepMind It seems the derivation for the online and reformulated IQ-learn follows what we derived in our Dual RL () work previously! We also derive the dynamics regularization+Energy (MLE) interpretation. Curious about connection and differences here..

Tweet card media

Dual RL: Unification and New Methods for Reinforcement and...

The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of...

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 years

@rl_agent Congratulations Lisa!!

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

@kevin_zakka If only RL/robotics people started caring abt citations...

0

0

0

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

(4/5) You don't need an LLM to show this. The community can save computing and time by testing their algorithms on our simple Tree-MDP (bringing back the age-old practice of toy MDPs in RL). The toy MDP runs in colab taking seconds.

Tweet media one

1

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 years

Amazing work from labmate @christinajyuan . Especially enjoyed the simplicity of the nice idea and the way it is presented!

@scottniekum

Scott Niekum

3 years

This week at NeurIPS: We show the existence of a Spectrum of Off-Policy Estimators (SOPE) between importance sampling methods and density-based methods such as DICE, enabling a principled bias-variance tradeoff for off-policy evaluation. 1/

1

0

20

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

9 months

@EugeneVinitsky Recently went through this process. I liked reading the original paper initially but nothing beats trying it yourself in a few lines of code through llamaindex, etc to actually get the feel of whats possible and what is not

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

(5/5) Our imitation experiments on offline datasets show that ReCOIL performs well with mixed-quality datasets, even with low expert coverage!

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

5 years

Marvel Phase 4 :D

Tweet card media

MARVEL PHASE 4 OFFICIALLY REVEALED!!!

The official reveal of Marvel's phase 4 at San Diego Comic Con!!Instagram giveaway! @cosmicwarren13Subscribe to my other movie channel:https://www.youtube.co...

www.youtube.com

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

4 months

RLBRew is all set to open RLC @RL_Conference with a bang. Hoping to see you all there 🤩

@RLBRew_2024

RL Beyond Rewards Workshop

4 months

We can't wait to see you all in less than 4 weeks for our workshop. There is more to RL than just optimizing reward functions -- Our stellar list of speakers is ready to present interesting and diverse views on the topic! Come to the workshop and interact with them.

Tweet media one

0

1

10

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

5 months

@McaleerStephen Congrats!!🎉

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

4 years

Come check out our work f-IRL: Inverse RL via State Marginal Matching at the interactive session @corl_conf . Interactive Session 2020-11-16, 11:10 - 11:40 PST

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

Our paper also provides a number of interesting evaluations for DAAs (some of which were also hypothesized before) - the classification accuracy of the implied reward model is poorly correlated with win rate, scaling law, and much more. Paper link:

Tweet card media

Scaling Laws for Reward Model Overoptimization in Direct Alignment...

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF...

1

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 years

With all these amazing people and venues😄

Tweet media one

Tweet media two

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

@yukez @snasiriany @huihan_liu @UTCompSci @texas_robotics Congratulations!!

0

0

2

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 months

@akhil_bagaria @BrownCSDept @RichardSSutton @mlittmancs @BrownBigAI Congratulations Akhil!!! 🎉🎉🤩

0

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

(2/5) DAA's implement a loss that looks like supervised learning. So, where is the reward overoptimization coming from? Irrespective of KL - It turns out the loss function DAA's use is heavily underspecified and has potentially ♾ solutions. Check out the paper for details

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

Led by @JoeyHejna with collaborators @rm_rafailov @chelseabfinn @scottniekum W.Bradley Knox @DorsaSadigh

0

0

0

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

RLHF has historically operated under the assumption that humans give pref according to the total reward of segments but this is not nec. true --- they also reason about cumulative suboptimality that the seg. will result in the future. See Knox et al. ()

Tweet card media

Models of human preference for learning reward functions

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function...

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

@sid09_singh @NVIDIAAI In the bay area too, lets catch up

0

0

0

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 years

2/n In this work, we study the H-step lookahead theoretically and empirically, presenting an instantiation ‘LOOP’ which improves performance in Online RL, Offline RL, and Safe RL.

Tweet media one

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 years

3/n Naively using H-step lookahead makes value learning computationally difficult, so we propose an efficient framework "LOOP" which learns a value function off-policy and corrects for the actor-divergence via Actor Regularized Control (ARC).

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

4 years

@JohnCLangford I wish there were positions for masters students as well :(

0

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

7 months

@g_k_swamy @EugeneVinitsky I don't think I ever understood why you need more sophisticated techniques. Since all we are trying to do here is match conditionals, why is KL/BC insufficient? What advantages do a min/max or IPM like formulation in SPIN provide? Might just be missing something

2

0

0

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 months

(3/6) DILO's insight? We can leverage tools from duality to directly predict multi-step deviation from expert's occupancy in a purely off-policy way. This builds upon our prior work on Dual-RL ()

Tweet card media

Dual RL: Unification and New Methods for Reinforcement and...

The goal of reinforcement learning (RL) is to find a policy that maximizes the expected cumulative return. It has been shown that this objective can be represented as an optimization problem of...

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

@ShivaamVats Congratulations Shivam!!🎉🎉🤩

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 years

@rudrasohan @GoogleAI @GoogleIndia @fooobar Congratulations!! Amazing news🎉

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

1 year

Threads is on the play store now. Will it provide enough incentive for the academic twitter to move there is the question.

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

@mjethalia6 @therealweissman Good thing he's making it 'Hyderabadi'😂

0

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

4 months

@farairesearch The link seems to be down

0

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

2 years

3/7 We propose that IRL can be viewed as a 2-player ranking game between a reward and the policy player. The reward's job is to satisfy the rankings in the dataset and policy maximizes this reward function. Expert demonstrations are simply preferred more than any other behavior.

Tweet media one

1

0

1

@harshit_sikchi

Harshit Sikchi

@harshit_sikchi

3 months

Really cool and simple method for learning controllable representations in RL!

@maxbrudolph

Max Rudolph

3 months

If you’re in Amherst, MA for RLC 2024, come to the RLHF poster session where I’ll be presenting “Learning Action-based Representations using Invariance” We show that you can bootstrap myopic state representations to capture features relevant to long-horizon control! Link below!

1

4

20

0

0

1