Costa Huang @vwxyzjn Twitter profile

Pinned Tweet

Costa Huang

6 months

Happy to share our work on reproducing RLHF scaling behaviors in @OpenAI 's work in summarizing from feedback. We built an RLHF pipeline from scratch and enumerated over 20+ implementation details 🚀 Fun collab with @mnoukhov , @arianTBD , @krasul , @weixunwang , and @_lewtun 📜

7

68

347

Last Seen Profiles

@SntearyZtBSg

@VmaniakJ

@Emarojocai

@Jonnydame4

@viscousmoment

@educateuncas

@crispinbest

@pemuasbirahiku

@Mason_Patel

@KeemoSl

@CanMarketingGuy

@pyoco_info

@DanielAdu225479

@shinjitrb

@Moe_Jhonny

@moniq34720

@unidentginger

@froidzy

@jandakembangstw

@Marcel_LJ

@saaaalles

@Jolie5180

@AnitaJaram11805

@lindseywb

@ASiC_Casalta

@BinorRaja

@Natalonso

@CARTAGENAGAYXXX

@caoronghui1

@KidsInPain

@StwGendut

@StudioFreakyXXX

@Freesiiiaaa

@EighthHarbinger

@allen_zahir

@_goularttg

Costa Huang

@vwxyzjn

7 months

I created prob the most accessible Direct Preference Optimization (DPO) examples with TRL 🤗. 114 lines of code, fit in a collab T4 GPU, and takes ~6 minutes to run (or 44 seconds on an H100). I took @OpenAI 's earlier RLHF dataset on a stylistic task of making completions more

1

68

470

Costa Huang

@vwxyzjn

2 months

After an amazing year of building @huggingface 🤗, I've wrapped up my journey here. I am grateful for the opportunity to work with so many talented individuals and on exciting projects, such as TRL, Zephyr, Constitutional AI, NuminaMath 7B, PPO implementation details, online DPO

35

11

389

Costa Huang

@vwxyzjn

11 months

@tianlinliu0121 , @lvwerra , and I are happy to share a great repro of @OpenAI 's early RLHF codebase, with nearly identical learning curves. We also summarized impl details (did you know Adam impl details could impact RLHF?) 📜 Blog post: A 🧵👇

9

65

339

Costa Huang

@vwxyzjn

3 years

Super excited to share this video tutorial I made implementing PPO in @PyTorch from scratch using @tensorboard and @weights_biases 🤖 In this video, I go over 11 core implementation details. . 1/18

5

55

260

Costa Huang

@vwxyzjn

3 months

Wow, I am featured on @github Trending with @cleanrl_lib ! During my undergrad, I used to open up github trending every day, trying to find fun projects / developers. Now, being part of that list feels like a dream come true! 🤗

20

10

213

Costa Huang

@vwxyzjn

1 year

Excited to share that I have started at @huggingface ! Going to continue working on research and open-source🤗

26

11

225

Costa Huang

@vwxyzjn

3 months

It's time to put "RL" back in "RLHF". I am thrilled to introduce the RLOOTrainer (REINFORCE Leave One-Out) in TRL, which is a new online RL method for alignment that requires less GPU memory and takes less time to converge! 🤝 @CohereForAI @aahmadian_ A🧵

Putting RL back in RLHF

huggingface.co

3

46

215

Costa Huang

@vwxyzjn

2 years

Excited to share our @iclr_conf blog post "The 37 Implementation Details of Proximal Policy Optimization" 📜 Blog + 🎥 Video tutorials + ⌨️ Code: 🤝 w/ @RousslanDossa , @araffin2 , Anssi Kanervisto, and @weixunwang . a thread 🧵 1/32

7

53

211

Costa Huang

@vwxyzjn

2 years

🔥 CleanRL's paper has been accepted to @JmlrOrg ! Introducing @cleanrl_lib at v1.0.0! We have added reworked documentation, JAX support, hyperparameter tuning, and more. 📜 Paper: 💾Release:

Release v1.0.0 CleanRL Release 🎉 · vwxyzjn/cleanrl

🎉 We are thrilled to announce the v1.0.0 CleanRL Release. Along with our CleanRL paper's recent publication in Journal of Machine Learning Research, our v1.0.0 release includes reworked documen...

github.com

11

35

195

Costa Huang

@vwxyzjn

12 days

Happy to share that the online DPO work is now in TRL! I was shocked that it worked so well, matching RLOO / PPO's performance in TL;DR 🔥Inspired by @ShawnGuo13 , @mnoukhov prototyped an online DPO codebase, and @QGallouedec , @edwardbeeching @_lewtun and I brought it to TRL!

3

23

180

Costa Huang

@vwxyzjn

2 years

Prototyping @cleanrl_lib 's DDPG with #jax , #flax , and #optax - I am seeing a 4x performance improvement 🤯🤯🤯🤯🤯 So yeah, JAX is coming to @cleanrl_lib 😃😃

6

23

160

Costa Huang

@vwxyzjn

1 year

😁😁😁 I passed my Ph.D. defense! Grateful to my advisor @santiontanon , my committee members @edk208 , @v_gkatzelis , Shahin Jabbari, Simon Lucas, and my collaborators along the way. Personally, so many thanks to my parents, my partner, and my friends 🙏

31

3

155

Costa Huang

@vwxyzjn

7 months

Happy to share our Constitutional AI (CAI) recipe for Open LLMs! We leverage @MistralAI model to self-improve on red teaming prompts from @AnthropicAI ! Interestingly, we found CAI models to be more resilient to DAN! 📖 Blog:

Constitutional AI with Open LLMs

huggingface.co

5

29

149

Costa Huang

@vwxyzjn

2 years

🤯 85% median human normalized score, 57 Atari games, 3 seeds, finished in 6 GPU days! @cleanrl_lib now has one of the fastest PPO implementations in ALE w/ EnvPool and JAX. It could even rival SEED RL's R2D2 in the first 45 mins (*) 📜 docs: A thread 🧵

5

20

149

Costa Huang

@vwxyzjn

11 months

RL is such a full-stack approach tbh. If there is a bug, it could literally be anywhere in the codebase; there is not much you can do other than just 1) staring really really hard 2) thinking "step by step" again and again 3) running tons of end-to-end experiments 🫠

9

5

145

Costa Huang

@vwxyzjn

1 year

I am happy to present our work on @cleanrl_lib at #ICML2023 on Tuesday. I will talk about making complex RL implementations into ~400 lines of code. DM is open. Co-authors @RousslanDossa , @yooceii , @__dipam__ , @kinal_mehta , Jeff Braga, @_joaogui1 .

2

25

130

Costa Huang

@vwxyzjn

2 years

Hi everyone - I am planning to graduate by May 2023 & now looking for a full-time position in ML research/engineering! I love reinforcement learning and open source ❤️. Please reach out and share! CV:

5

19

122

Costa Huang

@vwxyzjn

2 years

Thanks to @masud99r , @cleanrl_lib has a new algorithm called Robust Policy Optimization — 5 lines of code change to PPO to get better performance in 57 out of 61 continuous action envs 🚀 (e.g., dm_control) 📜docs: 💾code: 👇

5

25

123

Costa Huang

@vwxyzjn

2 years

Happy to share my @nvidia internship's work: @cleanrl_lib 's PPO now supports Isaac Gym! 📜 docs: A short 🧵

2

15

117

Costa Huang

@vwxyzjn

7 months

I am excited to co-lead with @argilla_io on generating large-scale preference datasets with Mixtral and Nous-Hermes-2-Yi-34B, judged by PairRM. In particular, we used llm-swarm to spin up tens of 8xH100 inference endpoints on our Slurm cluster, finishing 1M synthetic generations

6

19

113

Costa Huang

@vwxyzjn

2 years

New year, new release 🔥 Introducing the first beta of openrlbenchmark, a tool to grab tracked @weights_biases metrics from popular RL libraries, such as SB3, CleanRL, baselines, Tianshou, etc. 💾Colab: 📜Release note: Thread🧵👇

3

27

115

Costa Huang

@vwxyzjn

2 months

Google's new Gemma 2 is out! It comes with 9B and 27B, both of which are trained with RLHF and model merging. You can fine-tune the 27B models with 🤗 TRL! When using LoRA and 4-bit, training takes ~30GB of GPU memory. 🔥 Here is an example training run

3

25

115

Costa Huang

@vwxyzjn

3 years

CleanRL now has 600+ GitHub stars! Thank you 🙏 Let's make single-file implementations of RL algorithms great! #SimplerCode Repo: Docs: Preprint: #starhistory #GitHub #OpenSource @StarHistoryHQ

2

21

110

Costa Huang

@vwxyzjn

2 years

I am happy to share that @cleanrl_lib now has a benchmarked DDPG + JAX implementation that is roughly 2.5-4x faster than DDPG + @PyTorch . 📜 docs:

3

21

105

Costa Huang

@vwxyzjn

2 years

Thanks to @_joaogui1 's awesome contribution 🙏, @cleanrl_lib now has a TD3 + JAX implementation that is 2-4x faster than the TD3 + @PyTorch equivalent 🔥. Running on TPU is now possible, too 🚀! 📜 docs: 💾 code: A short 🧵1/x

3

20

106

Costa Huang

@vwxyzjn

1 year

I am excited to have contributed to enabling LLM + Tool + RL! Kudos to @lvwerra @younesbelkada for this fun journey 🤗 I feel that I have learned more about LLM in these two months than in the past year 😆 Check out the @Gradio demo I helped prototype!

Leandro von Werra

@lvwerra

1 year

Introducing TextEnvironments in TRL 0.7.0! With TextEnvironments you can teach your language models to use tools to solve tasks more reliably. We trained models to use Wiki search and Python to answer trivia and math questions! Let's have a look how🧵

5

35

162

3

21

103

Costa Huang

@vwxyzjn

3 years

We will be presenting our work on CleanRL at the @PyTorch Developer Day Event on Dec 2nd! Come to say hi to us :) #computervision #opensource #deeplearning #AI #PTD2

5

24

104

Costa Huang

@vwxyzjn

3 years

Out of all the machine learning stuff, deep reinforcement learning requires the least tuning and works out of box😊

t.

@scandalistics

3 years

561

203

4K

5

105

Costa Huang

@vwxyzjn

1 month

Releasing `costa_utils` to help better visualize HF datasets. No more squinting eyes at narrowly formatted texts! ``` pip install costa_utils python -m costa_utils.hf_viz \ --sft AI-MO/NuminaMath-TIR \ --split train \ --sft_messages_column_name messages \

3

19

95

Costa Huang

@vwxyzjn

3 years

Here is the second PPO video tutorial on training agents to play the Atari games with @PyTorch and @weights_biases . In this video, I go over 9 Atari-specific implementation details, matching SB3's performance with ~340 lines of code. 1/16

1

15

91

Costa Huang

@vwxyzjn

11 months

So tried running my first scaling laws experiments with the same learning rate. The results are 🤡 Do people perform a hyperparam sweep **per model size**?

7

2

91

Costa Huang

@vwxyzjn

8 months

Fun fact: DPO's reward modeling beginning accuracy is 0, but there is a good reason: policy and ref policy are the same, and therefore initial logprobs the same, and chosen_rewards = rejected_rewards = 0.

1

11

79

Costa Huang

@vwxyzjn

11 months

I am excited to share Cleanba, our new reproducible and efficient deep reinforcement learning platform! 📜 Paper: 💾 Repo: 🤝 w/ @Trinkle23897 @tan51616 @mavenlin @zhongwen2009 @santiontanon A thread 🧵👇

1

17

91

Costa Huang

@vwxyzjn

22 days

Yeah, @johnschulman2 's blog post on KL divergence is amazing. I tried it out with TRL at some point, but the k3 estimator exploded for some reason 👀. It's cool to see GRPO use it successfully, though; maybe it's time to revisit.

YouJiacheng

@YouJiacheng

22 days

Just noticed the author of this great blog is John Schulman.🤯🤯🤯 I initially found this blog in CleanRL about 2 years ago (figure 1). And I just found it is written by John Schulman when I read DeepSeekMath paper to learn GRPO (figure 2).

0

21

124

2

6

88

Costa Huang

@vwxyzjn

2 years

Working on a more explicit PPO pseudo-code that more accurately reflects core implementation👀 WDYT?

1

7

86

Costa Huang

@vwxyzjn

5 months

so… prioritized experience replay?

AK

@_akhaliq

5 months

Microsoft presents Rho-1 Not All Tokens Are What You Need Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for

10

79

455

6

4

86

Costa Huang

@vwxyzjn

2 months

So I woke up and my X exploded with notification 😂 Thanks @jsuarez5341 , and everyone should follow him, too! His work on @puffer_ai is extremely helpful to apply RL at scale and to more tasks!

0

3

76

Costa Huang

@vwxyzjn

3 years

I spent 68,002 hours training models in 2021 😱😱🤯🤯👀👀💸💸 @weights_biases

5

3

73

Costa Huang

@vwxyzjn

2 years

Shout out to @kinal_mehta for contributing a JAX implementation of @marcgbellemare , @wwdabney , and Rémi Munos' **C51** to @cleanrl_lib , which is 25% faster than our @PyTorch C51 variant 🔥 💾 code: 📜 docs:

1

11

73

Costa Huang

@vwxyzjn

2 years

Guess who will be running some multi-GPU experiments 👀 for @cleanrl_lib

3

1

71

Costa Huang

@vwxyzjn

2 years

Spent quite some time writing documentation for multi-GPU PPO 👀

2

10

70

Costa Huang

@vwxyzjn

11 months

My PR to otpax has just been merged by @matteohessel 😁! Did you know RMSprop implementation details matter? When setting it to align @PyTorch RMSprop implementation details, Cleanba IMPALA gets a 13% increase in median HNS in Atari! What's going on? A thread 🧵

5

6

68

Costa Huang

@vwxyzjn

1 year

Sharing some interesting ongoing benchmark efforts happening at , which helps provide clearer expectations on model training! w/ different base models w/ PEFT

GitHub - huggingface/trl: Train transformer language models with reinforcement learning.

Train transformer language models with reinforcement learning. - huggingface/trl

github.com

1

10

67

Costa Huang

@vwxyzjn

3 years

🎉Glad to share the Gym-μRTS paper, where we train a SOTA agent in full-game μRTS with RL; detailed ablation studies are included. 📜Paper: ✏️Blog & Videos: ⌨️Code: 👉This is a thread 1/X

1

22

64

Costa Huang

@vwxyzjn

9 months

Excited to present @cleanrl_lib ! Basically the nanoGPT equivalent in RL 😁

Joseph Suarez (e/🐡)

@jsuarez5341

9 months

Excited to have Costa @vwxyzjn as our third speaker at NeurIPS! It is rare to find brilliance as succinct as CleanRL, and this year's Neural MMO competition would not be possible without it.

0

5

47

1

11

67

Costa Huang

@vwxyzjn

4 months

Welcome to RL 😂. You can try to seed everything and compare `print(actor.fc1.weight.sum())` between your impl and the reference impl. Try to figure out when those sums differ. There is no magic in the world; different learning curves mean different impl details.

Kagi

@KagiJournal

4 months

I'm stunlocked, I need ML tech support. Been sitting on this for weeks now and Im probably in a cognitive local optimum and need fresh eyes on this. Is there anyone with RL experience, especially with SAC, that would be willing to help me out? Twitter pls 🥺 Details on pic

4

43

3

5

64

Costa Huang

@vwxyzjn

2 years

🚀 @cleanrl_lib now has an experimental hyperparameter tuner which is a tailored fit to publish RL papers. docs: code: Here is why👇

2

13

65

Costa Huang

@vwxyzjn

3 years

Here is the third (and final) PPO video tutorial on training agents to perform robotics tasks with @PyTorch and @weights_biases . This video goes over eight implementation details specific to games with continuous action spaces such as Pybullet. 1/17

2

17

63

Costa Huang

@vwxyzjn

9 months

Come check out my code. I will show you some beautiful code 😁🔥. And also talk about @cleanrl_lib 's hyperparameter tuning tool, RLops, Open RL Benchmark, running at scale using hundreds of GPUs with Cleanba and slurm!

Joseph Suarez (e/🐡)

@jsuarez5341

9 months

Excited to have Costa @vwxyzjn as our third speaker at NeurIPS! It is rare to find brilliance as succinct as CleanRL, and this year's Neural MMO competition would not be possible without it.

0

5

47

0

3

61

Costa Huang

@vwxyzjn

2 years

Experimenting *RLops* at @cleanrl_lib — we will soon be able to compare the library's performance at different versions 🤩! Rough idea: tag experiments in @weights_biases with a PR number and build tools to generate analysis! PR:

3

11

61

Costa Huang

@vwxyzjn

1 year

Sharing exciting updates 😀

Cohere For AI

@CohereForAI

1 year

Join us on Monday, Oct. 2nd as our Community-led Reinforcement Learning Group welcomes @vwxyzjn , ML Engineer at Hugging Face to present "Cleanba: A Reproducible Distributed Deep Reinforcement Learning Platform." Learn more:

1

9

42

3

2

59

Costa Huang

@vwxyzjn

7 months

Today, we read @billyuchenlin et al.'s paper . I like their visualization techniques on what aligned models do differently ❤️. The idea is to sample a response from the aligned LLM and check if base LLM would greedy sample the same tokens; if so, then color

1

9

57

Costa Huang

@vwxyzjn

7 months

Hi 🙂

Aran Komatsuzaki

@arankomatsuzaki

7 months

Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning

2

37

159

1

6

58

Costa Huang

@vwxyzjn

4 months

Cleanba has been accepted at ICLR 😆! Presenting today (Wed)at 10:45 AM in Halle B, booth #285 . Unfortunately, due to visa issues, I can't attend in person, but my collaborator @mavenlin will set up the poster and an iPad so I can join virtually 😊. Stop by if you'd like to chat

Costa Huang

@vwxyzjn

11 months

I am excited to share Cleanba, our new reproducible and efficient deep reinforcement learning platform! 📜 Paper: 💾 Repo: 🤝 w/ @Trinkle23897 @tan51616 @mavenlin @zhongwen2009 @santiontanon A thread 🧵👇

1

17

91

2

12

57

Costa Huang

@vwxyzjn

2 years

Played a bit with Reincarnate RL by @agarwl_ @max_a_schwarzer @pcastr — DQN student (IMPALA CNN) quickly surpasses its teacher (DQN Nature CNN) in 500k offline steps and 2M online steps 🤩 Looking forward to supporting reusing prior computation in RL in @cleanrl_lib !

2

7

55

Costa Huang

@vwxyzjn

2 years

🤩 Came across an awesome offline RL library that is easy to read and well-benchmarked by @vladkurenkov @scitator :

GitHub - tinkoff-ai/CORL: High-quality single-file implementations of SOTA Offline and Offline-to...

High-quality single-file implementations of SOTA Offline and Offline-to-Online RL algorithms: AWAC, BC, CQL, DT, EDAC, IQL, SAC-N, TD3+BC, LB-SAC, SPOT, Cal-QL, ReBRAC - tinkoff-ai/CORL

github.com

0

13

53

Costa Huang

@vwxyzjn

2 years

Shout out to @kinal_mehta and @yooceii for contributing a DQN + JAX implementation to @cleanrl_lib ! It's about 25% faster than the @PyTorch variant :D 📜 docs: 💾 code: 🎙PR: Short 🧵on JAX gotchas

2

13

52

Costa Huang

@vwxyzjn

2 years

@Bam4d On the other hand, RL is more alive than ever. ChatGPT is *trained with RL* and uses PPO 😁. It's the first RL application at an incredible scale, going beyond AlphaStar and OpenAI Five. A shameless plug: check out our PPO blog post 😉

Costa Huang

@vwxyzjn

2 years

Excited to share our @iclr_conf blog post "The 37 Implementation Details of Proximal Policy Optimization" 📜 Blog + 🎥 Video tutorials + ⌨️ Code: 🤝 w/ @RousslanDossa , @araffin2 , Anssi Kanervisto, and @weixunwang . a thread 🧵 1/32

7

53

211

0

4

51

Costa Huang

@vwxyzjn

3 months

This is really amazing work from @simon_zhai et al. The direction is exciting because VLM could empower a new generation of RL codebase that learns embodied tasks with much greater sample efficiency. Imagine having an LLM + RL with the human normalized score learning curve in

Simon Zhai

@simon_zhai

4 months

(1/n) Can we improve the decision-making capabilities of VLM / MLLM with end-to-end RL training? ❌reward models learned from human feedback ✅task rewards from the environment Yes, we can! Website: A thread 👇

7

44

318

1

9

50

Costa Huang

@vwxyzjn

10 months

I personally found PhD quite rewarding. It’s also beyond building a product and more on things like scientific methods. My way of thinking things have changed quite a bit 😊

Thomas Scialom

@ThomasScialom

10 months

I strongly disagree. There are many paths to success, and doing a PhD is never a suboptimal choice. Both professionally and personally.

5

1

69

0

2

45

Costa Huang

@vwxyzjn

2 years

To date, almost all deep RL libraries have implemented A2C and PPO as distinct algorithms. Our work suggests you can just implement PPO and run A2C via configurations 😀

Antonin Raffin

@araffin2

2 years

Did you know that Advantage Actor Critic (A2C) was a special case of PPO? This might seem trivial to some, but it has some interesting implications. One of them is that PPO does not clip its surrogate objective in the first iteration, which is useful for debugging.

1

9

38

2

4

42

Costa Huang

@vwxyzjn

2 years

Q: If you have an initial observation S_0 and execute an action A_0 to obtain state S_1, should the reward be R_0 or R_1? 🤔 A: Apparently, you can do both. This is an interesting "off by one" notation difference between @RichardSSutton and @johnschulman2 , as shown in the figure

5

3

40

Costa Huang

@vwxyzjn

11 months

@tianlinliu0121 then did this highly technical analysis showing the reason for it 🔥🤯. Sometimes I wonder if OAI's almost magical models come down to many many seemingly small stuff like this.

2

7

38

Costa Huang

@vwxyzjn

4 months

Experimenting with some PPO / chat recipes. I noticed there is always a drop off in RLHF reward initially (`(score.mean() - per_token_kl.sum(1).mean())`). Do people observe similar phenomena?

4

5

39

Costa Huang

@vwxyzjn

2 years

@Thom_Wolf Main paper: here are steps 1-5. Appendices: exactly how to go from step 4 to step 5.

1

6

38

Costa Huang

@vwxyzjn

11 months

Presenting our distributed DRL platform at 2:30 PM ET 😁 Looking forward to seeing you there!

Cohere For AI

@CohereForAI

1 year

Join us on Monday, Oct. 2nd as our Community-led Reinforcement Learning Group welcomes @vwxyzjn , ML Engineer at Hugging Face to present "Cleanba: A Reproducible Distributed Deep Reinforcement Learning Platform." Learn more:

1

9

42

1

6

36

Costa Huang

@vwxyzjn

4 years

Did my first self-play agent project using `slimevolleygym` fro @hardmaru ! The agent seems to have already converged at 30M timestpes. Checkout the ongoing training at

1

3

36

Costa Huang

@vwxyzjn

20 days

❤️uv is really one of the best initiatives that has happened in the last five years. Amazing execution, focus, and dedication. Looking forward to use it to make builds more reproducible and takes 1/100th of time with pip!

Charlie Marsh

@charliermarsh

21 days

Today, we're shipping a series of features that move uv beyond a pip alternative, and into an end-to-end solution for managing Python projects, command-line tools, single-file scripts, and even Python itself. A single, unified tool. Like Cargo, for Python. It's very fast.

115

498

3K

0

36

Costa Huang

@vwxyzjn

4 years

👋Happy to announce the release of our Open RL Benchmark @ 0.3.0 , which benchmarks 34+ games with unprecedented level of transparency, openness, and reproducibility. See for a demo. A thread ⬇️.

Watch Open RL Benchmark @ 0.3.0 (benchmark.cleanrl.dev, 7+ algorithm...

Watch "Open RL Benchmark @ 0.3.0 (benchmark.cleanrl.dev, 7+ algorithm and 34+ games)" on Streamable.

streamable.com

3

9

33

Costa Huang

@vwxyzjn

2 years

👀 CleanRL's PPO now supports 48 `dm_control` envs and the mujoco v2 and mujoco v4 envs thanks to @DanielCHTan97 and @FaramaFound ! Windows, MacOs, and Linux support! 📜 docs: Details below 👇

1

5

31

Costa Huang

@vwxyzjn

2 years

Coming to New Orleans for @NeurIPSConf today! This is gonna be my second in-person conference 😆. I will be here until 12/2. DM me if you want to meet up and chat.

2

1

30

Costa Huang

@vwxyzjn

2 months

Wow this is pretty cool!

noah 🐔

@noahgsolomon

3 months

might work to finish my proximal policy optimization interactive 3D blog. any thoughts?

15

6

109

1

0

28

Costa Huang

@vwxyzjn

7 months

Thanks @QGallouedec for co-leading this work! Community-empowered project at its best :) Could never done this without you all 🔥

Quentin Gallouédec

@QGallouedec

7 months

🚀 The Open RL Benchmark paper is out! 📝 Open RL Benchmark is everything I love about open research: a global, decentralized and open collaboration pooling vast training data for universal benefit.

3

12

66

1

8

29

Costa Huang

@vwxyzjn

2 years

@cHHillee Maybe not `torch.functional` because there is already `torch.nn.functional`, which may be confusing?

1

0

29

Costa Huang

@vwxyzjn

3 years

Wooo I wonder what’s in the box 🎁 @weights_biases Happy holiday to all 🎆

2

0

27

Costa Huang

@vwxyzjn

8 months

Happy New Year! Doing a Costa's helpful snippet series 😁 # 1 how do you repeat shuffle small datasets with HF accelerate? You can use the `DataLoader(shuffle=True)` but you need to do a `torch.manual_seed` before the `accelerator.prepare(dataloader)`

0

3

24

Costa Huang

@vwxyzjn

2 years

Congrats on this incredible release! Love the extensive benchmark and model curation effort!

Edward Beeching

@edwardbeeching

2 years

Announcing the release of Sample Factory 2.0. A lightning fast production grade Deep RL library. Sample Factory 2.0 is a collaboration between @petrenko_ai from @uscresl and 🤗 @huggingface . 👉 Find out more on this 🧵

5

42

149

1

2

25

Costa Huang

@vwxyzjn

2 years

I am so emotional 🫠... After like 20 GPU quota request rejections from @awscloud , my request finally got approved. The trick is to say "I am flexible with all instances types and regions" and really elaborate your use case.

1

3

24

Costa Huang

@vwxyzjn

3 months

GPU 🔥🔥🔥🔥

Jarek Liesen

@JarekLiesen

3 months

🥳 I'm releasing Rejax, a lightweight library of fully vectorizable RL algorithms! ⚡ Enjoy lightning-fast speed using jax.jit on the training function 🧬Use vmap and pmap on hyperparameters 🔙 Log using flexible callbacks 🌐 Available @ 📸 Take a tour!

4

29

170

1

3

24

Costa Huang

@vwxyzjn

2 years

openrlbenchmark gives you access to the wall-clock times out-of-the-box 😛 Totally agree with your point. Runtime and hardware requirements are very important information for reproducibility purposes.

Eugene Vinitsky 🍒

@EugeneVinitsky

2 years

RL papers really need wall-clock times vs. reward curves in addition to sample complexity curves; I don't really know if a method is accessible on an academic budget without it

6

12

160

1

2

23

Costa Huang

@vwxyzjn

4 years

Great thanks to @Ivangrov for creating this amazing trailer illustrating my work co-authored with @santiontanon ! I spotted @weights_biases on @GitHub Trending about two years ago, and have been using it to manage our experiments ever since 🚀 #ReinforcementLearning

Weights & Biases

@weights_biases

4 years

Our featured report today by @vwxyzjn illustrates their latest work "Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games" in an interactive manner! Report: #MachineLearning #100DaysOfMLCode

0

3

18

2

8

21

Costa Huang

@vwxyzjn

5 months

@EugeneVinitsky Haha, there is a follow-up work on this haha. "N+ implementation details of RLHF" 😁

Costa Huang

@vwxyzjn

6 months

Happy to share our work on reproducing RLHF scaling behaviors in @OpenAI 's work in summarizing from feedback. We built an RLHF pipeline from scratch and enumerated over 20+ implementation details 🚀 Fun collab with @mnoukhov , @arianTBD , @krasul , @weixunwang , and @_lewtun 📜

7

68

347

0

1

22

Costa Huang

@vwxyzjn

2 months

Whoa!! Super cool work using online RLHF to train SOTA multilingual LLMs! Your RLOO algorithm is amazing, and I am glad our RLOOTrainer was helpful in this research 🔥🚀🚀

Arash Ahmadian

@aahmadian_

2 months

@johnamqdang @cheeesio @KreutzerJulia @ahmetustun89 @sarahookr Gotta give a massive shout-out to @vwxyzjn @_lewtun for the massive lift on getting the RLOO trainer onto TRL which we used for the training 🔥 would have been a major pain w/o it :)

0

1

19

0

3

21

Costa Huang

@vwxyzjn

3 years

🎉🚀Happy to announce Open RL Benchmark 0.5.0, which is an interactive, reproducible, and comprehensive benchmark of Deep Reinforcement Learning algorithms. Benchmark: CleanRL: Demo:

Open RL Benchmark by CleanRL 0.5.0 | Reinforcement Learning

Open RL Benchmark: http://benchmark.cleanrl.dev/Discord channel: https://discord.gg/D6RCjA6sVTCleanRL: https://github.com/vwxyzjn/cleanrlFollow me on Twitter...

www.youtube.com

1

6

22

Costa Huang

@vwxyzjn

2 years

Incredible contribution by @yooceii ! The RND implementation was ~500 lines of code and easy to read! Together, we can make DRL learning infrastructure more transparent, reproducible, and efficient 🔥 📜 docs: 💾 code:

Chang Ye

@yooceii

2 years

Happy to share that @cleanrl_lib now supports Random Network Distillation + envpool, it's 3× faster than our first version without envpool and still have comparable performance to the original implementation, say 👋 to the long training time on hard-exploration games! Details👇

1

2

15

0

2

18

Costa Huang

@vwxyzjn

4 years

Reproduction of the Categorical DQN algorithm, including the visualization fo return distribution from Bellemare's lecture! @weights_biases

1

7

18

Costa Huang

@vwxyzjn

1 year

Making some cool CIs at and @huggingface TRL. With a `/deploy` command, the CI automated benchmarking and plotting stuff 😁. This reduces a lot of the manual work, which is super nice! Great for checking regressions and figuring out when PR goes wrong 🔥

0

18

Costa Huang

@vwxyzjn

10 months

And with some persistence (e.g., bug fixing) 🙂. Still not great accuracy though. I also found it helpful to understand the data more deeply. For example, the policies used in the validation set are more diverse than the ones used in the training set.

Costa Huang

@vwxyzjn

11 months

So tried running my first scaling laws experiments with the same learning rate. The results are 🤡 Do people perform a hyperparam sweep **per model size**?

7

2

91

1

0

18

Costa Huang

@vwxyzjn

4 years

Got the first version of self-play with microrts working 😆. Apparently the agents just learn to do a worker rush. Hoping to tune it to learn more complex behaviors!

4

18

Costa Huang

@vwxyzjn

9 months

Will be at NeurIPS! Feel free to DM me if you'd like to chat 🙂

0

18

Costa Huang

@vwxyzjn

1 year

Can someone do a @overleaf copilot / AI assistant please? 🫠 I am sure it will save a lot of hand strain for Ph.D. students. I am doing a lot of AI things to improve my writing

2

1

18

Costa Huang

@vwxyzjn

4 years

Oh wow probably using too much AWS! In 2020, I trained 15,035 models over 54,387 hours, with 1,387 hours of GPU training time. Get insights on your model training with @weights_biases ! #myyearwrapped

2

17

Costa Huang

@vwxyzjn

8 months

😈 got to love slurm

1

17

Costa Huang

@vwxyzjn

3 months

Love the interactive charts that allow you to see the individual metrics! The difference on MMLU is huge 🔥

Guilherme Penedo

@gui_penedo

3 months

We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link:

38

316

1K

1

17

Costa Huang

@vwxyzjn

9 months

😁 fun NeurIPS!

Leandro von Werra

@lvwerra

9 months

Finally made the @huggingface x @kaggle meet happen at #NeurIPS2023 !♥️ (the colour scheme was totally orchestrated!)

1

0

17

0

1

17

Costa Huang

@vwxyzjn

1 year

Sharing some exciting previews 😬. Thanks @jsuarez5341 for organizing!

Joseph Suarez (e/🐡)

@jsuarez5341

1 year

Costa @vwxyzjn is going to talk about Cleanba, the new distributed PPO implementation he is working on that can scale to hundreds of A100 GPUs Chris Lu: Discovery with Accelerated Evolutionary Meta-RL - How to Train RL Agents over a thousand times faster on a Single GPU

0

11

0

1

17

Costa Huang

@vwxyzjn

3 years

Adding some PPO documentation today and @vercel 's per-PR deployment previews are sooo good!🤯 It's much easier to share an edited draft like this with collaborators 😃

0

2

17

Costa Huang

@vwxyzjn

4 years

My latest work Action Guidance, co-authored with @santiontanon , just got featured at @weights_biases 's Gallery! Big thanks to @lavanyaai for editing and featuring my report! Our use of wandb has really helped us to discover insights and since become part of my daily workflow.

Lavanya 🐝

@lavanyaai

4 years

Our first featured report today is an original work by @vwxyzjn ! He presents a novel technique, action guidance, that helps an RL agent learn the optimal policy in a game with sparse rewards, while sampling efficiently. #deeplearning #machinelearning

1

3

11

0

1

15

Costa Huang

@vwxyzjn

1 year

@EugeneVinitsky @m_wulfmeier Scaling is definitely something we want to improve for @cleanrl_lib . We are going to release *Cleanba* soon 🙂. Cleanba is CleanRL-style distributed DRL with IMPALA and PPO. Implemented in JAX, Run on TPU. ~800 lines of code.

2

1

16