Costa Huang Profile Banner
Costa Huang Profile
Costa Huang

@vwxyzjn

4,998
Followers
1,423
Following
302
Media
1,564
Statuses

RLHF @allen_ai ; main dev of @cleanrl_lib ; CS PhD @DrexelUniv ; Ex @huggingface @CuraiHQ @weights_biases @NVIDIAAI @riotgames .

Philadelphia, PA
Joined March 2013
Don't wanna be here? Send us removal request.
Pinned Tweet
@vwxyzjn
Costa Huang
6 months
Happy to share our work on reproducing RLHF scaling behaviors in @OpenAI 's work in summarizing from feedback. We built an RLHF pipeline from scratch and enumerated over 20+ implementation details πŸš€ Fun collab with @mnoukhov , @arianTBD , @krasul , @weixunwang , and @_lewtun πŸ“œ
Tweet media one
Tweet media two
Tweet media three
Tweet media four
7
68
347
@vwxyzjn
Costa Huang
7 months
I created prob the most accessible Direct Preference Optimization (DPO) examples with TRL πŸ€—. 114 lines of code, fit in a collab T4 GPU, and takes ~6 minutes to run (or 44 seconds on an H100). I took @OpenAI 's earlier RLHF dataset on a stylistic task of making completions more
Tweet media one
1
68
470
@vwxyzjn
Costa Huang
2 months
After an amazing year of building @huggingface πŸ€—, I've wrapped up my journey here. I am grateful for the opportunity to work with so many talented individuals and on exciting projects, such as TRL, Zephyr, Constitutional AI, NuminaMath 7B, PPO implementation details, online DPO
35
11
389
@vwxyzjn
Costa Huang
11 months
@tianlinliu0121 , @lvwerra , and I are happy to share a great repro of @OpenAI 's early RLHF codebase, with nearly identical learning curves. We also summarized impl details (did you know Adam impl details could impact RLHF?) πŸ“œ Blog post: A πŸ§΅πŸ‘‡
Tweet media one
9
65
339
@vwxyzjn
Costa Huang
3 years
Super excited to share this video tutorial I made implementing PPO in @PyTorch from scratch using @tensorboard and @weights_biases πŸ€– In this video, I go over 11 core implementation details. . 1/18
Tweet media one
5
55
260
@vwxyzjn
Costa Huang
3 months
Wow, I am featured on @github Trending with @cleanrl_lib ! During my undergrad, I used to open up github trending every day, trying to find fun projects / developers. Now, being part of that list feels like a dream come true! πŸ€—
Tweet media one
20
10
213
@vwxyzjn
Costa Huang
1 year
Excited to share that I have started at @huggingface ! Going to continue working on research and open-sourceπŸ€—
26
11
225
@vwxyzjn
Costa Huang
3 months
It's time to put "RL" back in "RLHF". I am thrilled to introduce the RLOOTrainer (REINFORCE Leave One-Out) in TRL, which is a new online RL method for alignment that requires less GPU memory and takes less time to converge! 🀝 @CohereForAI @aahmadian_ A🧡
3
46
215
@vwxyzjn
Costa Huang
2 years
Excited to share our @iclr_conf blog post "The 37 Implementation Details of Proximal Policy Optimization" πŸ“œ Blog + πŸŽ₯ Video tutorials + ⌨️ Code: 🀝 w/ @RousslanDossa , @araffin2 , Anssi Kanervisto, and @weixunwang . a thread 🧡 1/32
Tweet media one
7
53
211
@vwxyzjn
Costa Huang
2 years
πŸ”₯ CleanRL's paper has been accepted to @JmlrOrg ! Introducing @cleanrl_lib at v1.0.0! We have added reworked documentation, JAX support, hyperparameter tuning, and more. πŸ“œ Paper: πŸ’ΎRelease:
11
35
195
@vwxyzjn
Costa Huang
12 days
Happy to share that the online DPO work is now in TRL! I was shocked that it worked so well, matching RLOO / PPO's performance in TL;DR πŸ”₯Inspired by @ShawnGuo13 , @mnoukhov prototyped an online DPO codebase, and @QGallouedec , @edwardbeeching @_lewtun and I brought it to TRL!
Tweet media one
3
23
180
@vwxyzjn
Costa Huang
2 years
Prototyping @cleanrl_lib 's DDPG with #jax , #flax , and #optax - I am seeing a 4x performance improvement 🀯🀯🀯🀯🀯 So yeah, JAX is coming to @cleanrl_lib πŸ˜ƒπŸ˜ƒ
Tweet media one
6
23
160
@vwxyzjn
Costa Huang
1 year
😁😁😁 I passed my Ph.D. defense! Grateful to my advisor @santiontanon , my committee members @edk208 , @v_gkatzelis , Shahin Jabbari, Simon Lucas, and my collaborators along the way. Personally, so many thanks to my parents, my partner, and my friends πŸ™
Tweet media one
31
3
155
@vwxyzjn
Costa Huang
7 months
Happy to share our Constitutional AI (CAI) recipe for Open LLMs! We leverage @MistralAI model to self-improve on red teaming prompts from @AnthropicAI ! Interestingly, we found CAI models to be more resilient to DAN! πŸ“– Blog:
5
29
149
@vwxyzjn
Costa Huang
2 years
🀯 85% median human normalized score, 57 Atari games, 3 seeds, finished in 6 GPU days! @cleanrl_lib now has one of the fastest PPO implementations in ALE w/ EnvPool and JAX. It could even rival SEED RL's R2D2 in the first 45 mins (*) πŸ“œ docs: A thread 🧡
Tweet media one
5
20
149
@vwxyzjn
Costa Huang
11 months
RL is such a full-stack approach tbh. If there is a bug, it could literally be anywhere in the codebase; there is not much you can do other than just 1) staring really really hard 2) thinking "step by step" again and again 3) running tons of end-to-end experiments 🫠
9
5
145
@vwxyzjn
Costa Huang
1 year
I am happy to present our work on @cleanrl_lib at #ICML2023 on Tuesday. I will talk about making complex RL implementations into ~400 lines of code. DM is open. Co-authors @RousslanDossa , @yooceii , @__dipam__ , @kinal_mehta , Jeff Braga, @_joaogui1 .
Tweet media one
2
25
130
@vwxyzjn
Costa Huang
2 years
Hi everyone - I am planning to graduate by May 2023 & now looking for a full-time position in ML research/engineering! I love reinforcement learning and open source ❀️. Please reach out and share! CV:
5
19
122
@vwxyzjn
Costa Huang
2 years
Thanks to @masud99r , @cleanrl_lib has a new algorithm called Robust Policy Optimization β€” 5 lines of code change to PPO to get better performance in 57 out of 61 continuous action envs πŸš€ (e.g., dm_control) πŸ“œdocs: πŸ’Ύcode: πŸ‘‡
Tweet media one
Tweet media two
5
25
123
@vwxyzjn
Costa Huang
2 years
Happy to share my @nvidia internship's work: @cleanrl_lib 's PPO now supports Isaac Gym! πŸ“œ docs: A short 🧡
2
15
117
@vwxyzjn
Costa Huang
7 months
I am excited to co-lead with @argilla_io on generating large-scale preference datasets with Mixtral and Nous-Hermes-2-Yi-34B, judged by PairRM. In particular, we used llm-swarm to spin up tens of 8xH100 inference endpoints on our Slurm cluster, finishing 1M synthetic generations
6
19
113
@vwxyzjn
Costa Huang
2 years
New year, new release πŸ”₯ Introducing the first beta of openrlbenchmark, a tool to grab tracked @weights_biases metrics from popular RL libraries, such as SB3, CleanRL, baselines, Tianshou, etc. πŸ’ΎColab: πŸ“œRelease note: ThreadπŸ§΅πŸ‘‡
Tweet media one
Tweet media two
3
27
115
@vwxyzjn
Costa Huang
2 months
Google's new Gemma 2 is out! It comes with 9B and 27B, both of which are trained with RLHF and model merging. You can fine-tune the 27B models with πŸ€— TRL! When using LoRA and 4-bit, training takes ~30GB of GPU memory. πŸ”₯ Here is an example training run
Tweet media one
Tweet media two
Tweet media three
3
25
115
@vwxyzjn
Costa Huang
3 years
CleanRL now has 600+ GitHub stars! Thank you πŸ™ Let's make single-file implementations of RL algorithms great! #SimplerCode Repo: Docs: Preprint: #starhistory #GitHub #OpenSource @StarHistoryHQ
Tweet media one
2
21
110
@vwxyzjn
Costa Huang
2 years
I am happy to share that @cleanrl_lib now has a benchmarked DDPG + JAX implementation that is roughly 2.5-4x faster than DDPG + @PyTorch . πŸ“œ docs:
Tweet media one
3
21
105
@vwxyzjn
Costa Huang
2 years
Thanks to @_joaogui1 's awesome contribution πŸ™, @cleanrl_lib now has a TD3 + JAX implementation that is 2-4x faster than the TD3 + @PyTorch equivalent πŸ”₯. Running on TPU is now possible, too πŸš€! πŸ“œ docs: πŸ’Ύ code: A short 🧡1/x
Tweet media one
3
20
106
@vwxyzjn
Costa Huang
1 year
I am excited to have contributed to enabling LLM + Tool + RL! Kudos to @lvwerra @younesbelkada for this fun journey πŸ€— I feel that I have learned more about LLM in these two months than in the past year πŸ˜† Check out the @Gradio demo I helped prototype!
@lvwerra
Leandro von Werra
1 year
Introducing TextEnvironments in TRL 0.7.0! With TextEnvironments you can teach your language models to use tools to solve tasks more reliably. We trained models to use Wiki search and Python to answer trivia and math questions! Let's have a look how🧡
Tweet media one
5
35
162
3
21
103
@vwxyzjn
Costa Huang
3 years
We will be presenting our work on CleanRL at the @PyTorch Developer Day Event on Dec 2nd! Come to say hi to us :) #computervision #opensource #deeplearning #AI #PTD2
Tweet media one
5
24
104
@vwxyzjn
Costa Huang
3 years
Out of all the machine learning stuff, deep reinforcement learning requires the least tuning and works out of box😊
Tweet media one
561
203
4K
5
5
105
@vwxyzjn
Costa Huang
1 month
Releasing `costa_utils` to help better visualize HF datasets. No more squinting eyes at narrowly formatted texts! ``` pip install costa_utils python -m costa_utils.hf_viz \ --sft AI-MO/NuminaMath-TIR \ --split train \ --sft_messages_column_name messages \
Tweet media one
Tweet media two
3
19
95
@vwxyzjn
Costa Huang
3 years
Here is the second PPO video tutorial on training agents to play the Atari games with @PyTorch and @weights_biases . In this video, I go over 9 Atari-specific implementation details, matching SB3's performance with ~340 lines of code. 1/16
Tweet media one
1
15
91
@vwxyzjn
Costa Huang
11 months
So tried running my first scaling laws experiments with the same learning rate. The results are 🀑 Do people perform a hyperparam sweep **per model size**?
Tweet media one
7
2
91
@vwxyzjn
Costa Huang
8 months
Fun fact: DPO's reward modeling beginning accuracy is 0, but there is a good reason: policy and ref policy are the same, and therefore initial logprobs the same, and chosen_rewards = rejected_rewards = 0.
Tweet media one
Tweet media two
1
11
79
@vwxyzjn
Costa Huang
11 months
I am excited to share Cleanba, our new reproducible and efficient deep reinforcement learning platform! πŸ“œ Paper: πŸ’Ύ Repo: 🀝 w/ @Trinkle23897 @tan51616 @mavenlin @zhongwen2009 @santiontanon A thread πŸ§΅πŸ‘‡
Tweet media one
1
17
91
@vwxyzjn
Costa Huang
22 days
Yeah, @johnschulman2 's blog post on KL divergence is amazing. I tried it out with TRL at some point, but the k3 estimator exploded for some reason πŸ‘€. It's cool to see GRPO use it successfully, though; maybe it's time to revisit.
Tweet media one
@YouJiacheng
YouJiacheng
22 days
Just noticed the author of this great blog is John Schulman.🀯🀯🀯 I initially found this blog in CleanRL about 2 years ago (figure 1). And I just found it is written by John Schulman when I read DeepSeekMath paper to learn GRPO (figure 2).
Tweet media one
Tweet media two
0
21
124
2
6
88
@vwxyzjn
Costa Huang
2 years
Working on a more explicit PPO pseudo-code that more accurately reflects core implementationπŸ‘€ WDYT?
Tweet media one
1
7
86
@vwxyzjn
Costa Huang
5 months
so… prioritized experience replay?
@_akhaliq
AK
5 months
Microsoft presents Rho-1 Not All Tokens Are What You Need Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for
Tweet media one
10
79
455
6
4
86
@vwxyzjn
Costa Huang
2 months
So I woke up and my X exploded with notification πŸ˜‚ Thanks @jsuarez5341 , and everyone should follow him, too! His work on @puffer_ai is extremely helpful to apply RL at scale and to more tasks!
0
3
76
@vwxyzjn
Costa Huang
3 years
I spent 68,002 hours training models in 2021 πŸ˜±πŸ˜±πŸ€―πŸ€―πŸ‘€πŸ‘€πŸ’ΈπŸ’Έ @weights_biases
Tweet media one
Tweet media two
5
3
73
@vwxyzjn
Costa Huang
2 years
Shout out to @kinal_mehta for contributing a JAX implementation of @marcgbellemare , @wwdabney , and RΓ©mi Munos' **C51** to @cleanrl_lib , which is 25% faster than our @PyTorch C51 variant πŸ”₯ πŸ’Ύ code: πŸ“œ docs:
Tweet media one
1
11
73
@vwxyzjn
Costa Huang
2 years
Guess who will be running some multi-GPU experiments πŸ‘€ for @cleanrl_lib
Tweet media one
3
1
71
@vwxyzjn
Costa Huang
2 years
Spent quite some time writing documentation for multi-GPU PPO πŸ‘€
2
10
70
@vwxyzjn
Costa Huang
11 months
My PR to otpax has just been merged by @matteohessel 😁! Did you know RMSprop implementation details matter? When setting it to align @PyTorch RMSprop implementation details, Cleanba IMPALA gets a 13% increase in median HNS in Atari! What's going on? A thread 🧡
Tweet media one
5
6
68
@vwxyzjn
Costa Huang
1 year
Sharing some interesting ongoing benchmark efforts happening at , which helps provide clearer expectations on model training! w/ different base models w/ PEFT
1
10
67
@vwxyzjn
Costa Huang
3 years
πŸŽ‰Glad to share the Gym-ΞΌRTS paper, where we train a SOTA agent in full-game ΞΌRTS with RL; detailed ablation studies are included. πŸ“œPaper: ✏️Blog & Videos: ⌨️Code: πŸ‘‰This is a thread 1/X
Tweet media one
1
22
64
@vwxyzjn
Costa Huang
9 months
Excited to present @cleanrl_lib ! Basically the nanoGPT equivalent in RL 😁
@jsuarez5341
Joseph Suarez (e/🐑)
9 months
Excited to have Costa @vwxyzjn as our third speaker at NeurIPS! It is rare to find brilliance as succinct as CleanRL, and this year's Neural MMO competition would not be possible without it.
Tweet media one
0
5
47
1
11
67
@vwxyzjn
Costa Huang
4 months
Welcome to RL πŸ˜‚. You can try to seed everything and compare `print(actor.fc1.weight.sum())` between your impl and the reference impl. Try to figure out when those sums differ. There is no magic in the world; different learning curves mean different impl details.
@KagiJournal
Kagi
4 months
I'm stunlocked, I need ML tech support. Been sitting on this for weeks now and Im probably in a cognitive local optimum and need fresh eyes on this. Is there anyone with RL experience, especially with SAC, that would be willing to help me out? Twitter pls πŸ₯Ί Details on pic
Tweet media one
4
4
43
3
5
64
@vwxyzjn
Costa Huang
2 years
πŸš€ @cleanrl_lib now has an experimental hyperparameter tuner which is a tailored fit to publish RL papers. docs: code: Here is whyπŸ‘‡
Tweet media one
2
13
65
@vwxyzjn
Costa Huang
3 years
Here is the third (and final) PPO video tutorial on training agents to perform robotics tasks with @PyTorch and @weights_biases . This video goes over eight implementation details specific to games with continuous action spaces such as Pybullet. 1/17
Tweet media one
2
17
63
@vwxyzjn
Costa Huang
9 months
Come check out my code. I will show you some beautiful code 😁πŸ”₯. And also talk about @cleanrl_lib 's hyperparameter tuning tool, RLops, Open RL Benchmark, running at scale using hundreds of GPUs with Cleanba and slurm!
@jsuarez5341
Joseph Suarez (e/🐑)
9 months
Excited to have Costa @vwxyzjn as our third speaker at NeurIPS! It is rare to find brilliance as succinct as CleanRL, and this year's Neural MMO competition would not be possible without it.
Tweet media one
0
5
47
0
3
61
@vwxyzjn
Costa Huang
2 years
Experimenting *RLops* at @cleanrl_lib β€” we will soon be able to compare the library's performance at different versions 🀩! Rough idea: tag experiments in @weights_biases with a PR number and build tools to generate analysis! PR:
Tweet media one
Tweet media two
3
11
61
@vwxyzjn
Costa Huang
1 year
Sharing exciting updates πŸ˜€
@CohereForAI
Cohere For AI
1 year
Join us on Monday, Oct. 2nd as our Community-led Reinforcement Learning Group welcomes @vwxyzjn , ML Engineer at Hugging Face to present "Cleanba: A Reproducible Distributed Deep Reinforcement Learning Platform." Learn more:
Tweet media one
1
9
42
3
2
59
@vwxyzjn
Costa Huang
7 months
Today, we read @billyuchenlin et al.'s paper . I like their visualization techniques on what aligned models do differently ❀️. The idea is to sample a response from the aligned LLM and check if base LLM would greedy sample the same tokens; if so, then color
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
9
57
@vwxyzjn
Costa Huang
7 months
Hi πŸ™‚
@arankomatsuzaki
Aran Komatsuzaki
7 months
Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning
Tweet media one
2
37
159
1
6
58
@vwxyzjn
Costa Huang
4 months
Cleanba has been accepted at ICLR πŸ˜†! Presenting today (Wed)at 10:45 AM in Halle B, booth #285 . Unfortunately, due to visa issues, I can't attend in person, but my collaborator @mavenlin will set up the poster and an iPad so I can join virtually 😊. Stop by if you'd like to chat
@vwxyzjn
Costa Huang
11 months
I am excited to share Cleanba, our new reproducible and efficient deep reinforcement learning platform! πŸ“œ Paper: πŸ’Ύ Repo: 🀝 w/ @Trinkle23897 @tan51616 @mavenlin @zhongwen2009 @santiontanon A thread πŸ§΅πŸ‘‡
Tweet media one
1
17
91
2
12
57
@vwxyzjn
Costa Huang
2 years
Played a bit with Reincarnate RL by @agarwl_ @max_a_schwarzer @pcastr β€” DQN student (IMPALA CNN) quickly surpasses its teacher (DQN Nature CNN) in 500k offline steps and 2M online steps 🀩 Looking forward to supporting reusing prior computation in RL in @cleanrl_lib !
Tweet media one
2
7
55
@vwxyzjn
Costa Huang
2 years
Shout out to @kinal_mehta and @yooceii for contributing a DQN + JAX implementation to @cleanrl_lib ! It's about 25% faster than the @PyTorch variant :D πŸ“œ docs: πŸ’Ύ code: πŸŽ™PR: Short 🧡on JAX gotchas
Tweet media one
2
13
52
@vwxyzjn
Costa Huang
2 years
@Bam4d On the other hand, RL is more alive than ever. ChatGPT is *trained with RL* and uses PPO 😁. It's the first RL application at an incredible scale, going beyond AlphaStar and OpenAI Five. A shameless plug: check out our PPO blog post πŸ˜‰
@vwxyzjn
Costa Huang
2 years
Excited to share our @iclr_conf blog post "The 37 Implementation Details of Proximal Policy Optimization" πŸ“œ Blog + πŸŽ₯ Video tutorials + ⌨️ Code: 🀝 w/ @RousslanDossa , @araffin2 , Anssi Kanervisto, and @weixunwang . a thread 🧡 1/32
Tweet media one
7
53
211
0
4
51
@vwxyzjn
Costa Huang
3 months
This is really amazing work from @simon_zhai et al. The direction is exciting because VLM could empower a new generation of RL codebase that learns embodied tasks with much greater sample efficiency. Imagine having an LLM + RL with the human normalized score learning curve in
@simon_zhai
Simon Zhai
4 months
(1/n) Can we improve the decision-making capabilities of VLM / MLLM with end-to-end RL training? ❌reward models learned from human feedback βœ…task rewards from the environment Yes, we can! Website: A thread πŸ‘‡
7
44
318
1
9
50
@vwxyzjn
Costa Huang
10 months
I personally found PhD quite rewarding. It’s also beyond building a product and more on things like scientific methods. My way of thinking things have changed quite a bit 😊
@ThomasScialom
Thomas Scialom
10 months
I strongly disagree. There are many paths to success, and doing a PhD is never a suboptimal choice. Both professionally and personally.
5
1
69
0
2
45
@vwxyzjn
Costa Huang
2 years
To date, almost all deep RL libraries have implemented A2C and PPO as distinct algorithms. Our work suggests you can just implement PPO and run A2C via configurations πŸ˜€
Tweet media one
@araffin2
Antonin Raffin
2 years
Did you know that Advantage Actor Critic (A2C) was a special case of PPO? This might seem trivial to some, but it has some interesting implications. One of them is that PPO does not clip its surrogate objective in the first iteration, which is useful for debugging.
Tweet media one
1
9
38
2
4
42
@vwxyzjn
Costa Huang
2 years
Q: If you have an initial observation S_0 and execute an action A_0 to obtain state S_1, should the reward be R_0 or R_1? πŸ€” A: Apparently, you can do both. This is an interesting "off by one" notation difference between @RichardSSutton and @johnschulman2 , as shown in the figure
Tweet media one
5
3
40
@vwxyzjn
Costa Huang
11 months
@tianlinliu0121 then did this highly technical analysis showing the reason for it πŸ”₯🀯. Sometimes I wonder if OAI's almost magical models come down to many many seemingly small stuff like this.
Tweet media one
Tweet media two
2
7
38
@vwxyzjn
Costa Huang
4 months
Experimenting with some PPO / chat recipes. I noticed there is always a drop off in RLHF reward initially (`(score.mean() - per_token_kl.sum(1).mean())`). Do people observe similar phenomena?
Tweet media one
4
5
39
@vwxyzjn
Costa Huang
2 years
@Thom_Wolf Main paper: here are steps 1-5. Appendices: exactly how to go from step 4 to step 5.
Tweet media one
1
6
38
@vwxyzjn
Costa Huang
11 months
Presenting our distributed DRL platform at 2:30 PM ET 😁 Looking forward to seeing you there!
Tweet media one
@CohereForAI
Cohere For AI
1 year
Join us on Monday, Oct. 2nd as our Community-led Reinforcement Learning Group welcomes @vwxyzjn , ML Engineer at Hugging Face to present "Cleanba: A Reproducible Distributed Deep Reinforcement Learning Platform." Learn more:
Tweet media one
1
9
42
1
6
36
@vwxyzjn
Costa Huang
4 years
Did my first self-play agent project using `slimevolleygym` fro @hardmaru ! The agent seems to have already converged at 30M timestpes. Checkout the ongoing training at
1
3
36
@vwxyzjn
Costa Huang
20 days
❀️uv is really one of the best initiatives that has happened in the last five years. Amazing execution, focus, and dedication. Looking forward to use it to make builds more reproducible and takes 1/100th of time with pip!
@charliermarsh
Charlie Marsh
21 days
Today, we're shipping a series of features that move uv beyond a pip alternative, and into an end-to-end solution for managing Python projects, command-line tools, single-file scripts, and even Python itself. A single, unified tool. Like Cargo, for Python. It's very fast.
Tweet media one
115
498
3K
0
0
36
@vwxyzjn
Costa Huang
4 years
πŸ‘‹Happy to announce the release of our Open RL Benchmark @ 0.3.0 , which benchmarks 34+ games with unprecedented level of transparency, openness, and reproducibility. See for a demo. A thread ⬇️.
3
9
33
@vwxyzjn
Costa Huang
2 years
πŸ‘€ CleanRL's PPO now supports 48 `dm_control` envs and the mujoco v2 and mujoco v4 envs thanks to @DanielCHTan97 and @FaramaFound ! Windows, MacOs, and Linux support! πŸ“œ docs: Details below πŸ‘‡
1
5
31
@vwxyzjn
Costa Huang
2 years
Coming to New Orleans for @NeurIPSConf today! This is gonna be my second in-person conference πŸ˜†. I will be here until 12/2. DM me if you want to meet up and chat.
2
1
30
@vwxyzjn
Costa Huang
2 months
Wow this is pretty cool!
@noahgsolomon
noah πŸ”
3 months
might work to finish my proximal policy optimization interactive 3D blog. any thoughts?
15
6
109
1
0
28
@vwxyzjn
Costa Huang
7 months
Thanks @QGallouedec for co-leading this work! Community-empowered project at its best :) Could never done this without you all πŸ”₯
@QGallouedec
Quentin GallouΓ©dec
7 months
πŸš€ The Open RL Benchmark paper is out! πŸ“ Open RL Benchmark is everything I love about open research: a global, decentralized and open collaboration pooling vast training data for universal benefit.
Tweet media one
3
12
66
1
8
29
@vwxyzjn
Costa Huang
2 years
@cHHillee Maybe not `torch.functional` because there is already `torch.nn.functional`, which may be confusing?
1
0
29
@vwxyzjn
Costa Huang
3 years
Wooo I wonder what’s in the box 🎁 @weights_biases Happy holiday to all πŸŽ†
Tweet media one
2
0
27
@vwxyzjn
Costa Huang
8 months
Happy New Year! Doing a Costa's helpful snippet series 😁 # 1 how do you repeat shuffle small datasets with HF accelerate? You can use the `DataLoader(shuffle=True)` but you need to do a `torch.manual_seed` before the `accelerator.prepare(dataloader)`
Tweet media one
Tweet media two
0
3
24
@vwxyzjn
Costa Huang
2 years
Congrats on this incredible release! Love the extensive benchmark and model curation effort!
@edwardbeeching
Edward Beeching
2 years
Announcing the release of Sample Factory 2.0. A lightning fast production grade Deep RL library. Sample Factory 2.0 is a collaboration between @petrenko_ai from @uscresl and πŸ€— @huggingface . πŸ‘‰ Find out more on this 🧡
5
42
149
1
2
25
@vwxyzjn
Costa Huang
2 years
I am so emotional 🫠... After like 20 GPU quota request rejections from @awscloud , my request finally got approved. The trick is to say "I am flexible with all instances types and regions" and really elaborate your use case.
Tweet media one
1
3
24
@vwxyzjn
Costa Huang
3 months
GPU πŸ”₯πŸ”₯πŸ”₯πŸ”₯
@JarekLiesen
Jarek Liesen
3 months
πŸ₯³ I'm releasing Rejax, a lightweight library of fully vectorizable RL algorithms! ⚑ Enjoy lightning-fast speed using jax.jit on the training function 🧬Use vmap and pmap on hyperparameters πŸ”™ Log using flexible callbacks 🌐 Available @ πŸ“Έ Take a tour!
4
29
170
1
3
24
@vwxyzjn
Costa Huang
2 years
openrlbenchmark gives you access to the wall-clock times out-of-the-box πŸ˜› Totally agree with your point. Runtime and hardware requirements are very important information for reproducibility purposes.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
@EugeneVinitsky
Eugene Vinitsky πŸ’
2 years
RL papers really need wall-clock times vs. reward curves in addition to sample complexity curves; I don't really know if a method is accessible on an academic budget without it
6
12
160
1
2
23
@vwxyzjn
Costa Huang
4 years
Great thanks to @Ivangrov for creating this amazing trailer illustrating my work co-authored with @santiontanon ! I spotted @weights_biases on @GitHub Trending about two years ago, and have been using it to manage our experiments ever since πŸš€ #ReinforcementLearning
@weights_biases
Weights & Biases
4 years
Our featured report today by @vwxyzjn illustrates their latest work "Action Guidance: Getting the Best of Sparse Rewards and Shaped Rewards for Real-time Strategy Games" in an interactive manner! Report: #MachineLearning #100DaysOfMLCode
0
3
18
2
8
21
@vwxyzjn
Costa Huang
5 months
@EugeneVinitsky Haha, there is a follow-up work on this haha. "N+ implementation details of RLHF" 😁
@vwxyzjn
Costa Huang
6 months
Happy to share our work on reproducing RLHF scaling behaviors in @OpenAI 's work in summarizing from feedback. We built an RLHF pipeline from scratch and enumerated over 20+ implementation details πŸš€ Fun collab with @mnoukhov , @arianTBD , @krasul , @weixunwang , and @_lewtun πŸ“œ
Tweet media one
Tweet media two
Tweet media three
Tweet media four
7
68
347
0
1
22
@vwxyzjn
Costa Huang
2 months
Whoa!! Super cool work using online RLHF to train SOTA multilingual LLMs! Your RLOO algorithm is amazing, and I am glad our RLOOTrainer was helpful in this research πŸ”₯πŸš€πŸš€
@aahmadian_
Arash Ahmadian
2 months
@johnamqdang @cheeesio @KreutzerJulia @ahmetustun89 @sarahookr Gotta give a massive shout-out to @vwxyzjn @_lewtun for the massive lift on getting the RLOO trainer onto TRL which we used for the training πŸ”₯ would have been a major pain w/o it :)
0
1
19
0
3
21
@vwxyzjn
Costa Huang
3 years
πŸŽ‰πŸš€Happy to announce Open RL Benchmark 0.5.0, which is an interactive, reproducible, and comprehensive benchmark of Deep Reinforcement Learning algorithms. Benchmark: CleanRL: Demo:
1
6
22
@vwxyzjn
Costa Huang
2 years
Incredible contribution by @yooceii ! The RND implementation was ~500 lines of code and easy to read! Together, we can make DRL learning infrastructure more transparent, reproducible, and efficient πŸ”₯ πŸ“œ docs: πŸ’Ύ code:
@yooceii
Chang Ye
2 years
Happy to share that @cleanrl_lib now supports Random Network Distillation + envpool, it's 3Γ— faster than our first version without envpool and still have comparable performance to the original implementation, say πŸ‘‹ to the long training time on hard-exploration games! DetailsπŸ‘‡
Tweet media one
1
2
15
0
2
18
@vwxyzjn
Costa Huang
4 years
Reproduction of the Categorical DQN algorithm, including the visualization fo return distribution from Bellemare's lecture! @weights_biases
1
7
18
@vwxyzjn
Costa Huang
1 year
Making some cool CIs at and @huggingface TRL. With a `/deploy` command, the CI automated benchmarking and plotting stuff 😁. This reduces a lot of the manual work, which is super nice! Great for checking regressions and figuring out when PR goes wrong πŸ”₯
Tweet media one
Tweet media two
0
0
18
@vwxyzjn
Costa Huang
10 months
And with some persistence (e.g., bug fixing) πŸ™‚. Still not great accuracy though. I also found it helpful to understand the data more deeply. For example, the policies used in the validation set are more diverse than the ones used in the training set.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
@vwxyzjn
Costa Huang
11 months
So tried running my first scaling laws experiments with the same learning rate. The results are 🀑 Do people perform a hyperparam sweep **per model size**?
Tweet media one
7
2
91
1
0
18
@vwxyzjn
Costa Huang
4 years
Got the first version of self-play with microrts working πŸ˜†. Apparently the agents just learn to do a worker rush. Hoping to tune it to learn more complex behaviors!
4
4
18
@vwxyzjn
Costa Huang
9 months
Will be at NeurIPS! Feel free to DM me if you'd like to chat πŸ™‚
0
0
18
@vwxyzjn
Costa Huang
1 year
Can someone do a @overleaf copilot / AI assistant please? 🫠 I am sure it will save a lot of hand strain for Ph.D. students. I am doing a lot of AI things to improve my writing
Tweet media one
2
1
18
@vwxyzjn
Costa Huang
4 years
Oh wow probably using too much AWS! In 2020, I trained 15,035 models over 54,387 hours, with 1,387 hours of GPU training time. Get insights on your model training with @weights_biases ! #myyearwrapped
2
2
17
@vwxyzjn
Costa Huang
8 months
😈 got to love slurm
Tweet media one
1
1
17
@vwxyzjn
Costa Huang
3 months
Love the interactive charts that allow you to see the individual metrics! The difference on MMLU is huge πŸ”₯
Tweet media one
@gui_penedo
Guilherme Penedo
3 months
We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: πŸ“š FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link:
Tweet media one
38
316
1K
1
1
17
@vwxyzjn
Costa Huang
9 months
😁 fun NeurIPS!
@lvwerra
Leandro von Werra
9 months
Finally made the @huggingface x @kaggle meet happen at #NeurIPS2023 !β™₯️ (the colour scheme was totally orchestrated!)
Tweet media one
1
0
17
0
1
17
@vwxyzjn
Costa Huang
1 year
Sharing some exciting previews 😬. Thanks @jsuarez5341 for organizing!
@jsuarez5341
Joseph Suarez (e/🐑)
1 year
Costa @vwxyzjn is going to talk about Cleanba, the new distributed PPO implementation he is working on that can scale to hundreds of A100 GPUs Chris Lu: Discovery with Accelerated Evolutionary Meta-RL - How to Train RL Agents over a thousand times faster on a Single GPU
0
0
11
0
1
17
@vwxyzjn
Costa Huang
3 years
Adding some PPO documentation today and @vercel 's per-PR deployment previews are sooo good!🀯 It's much easier to share an edited draft like this with collaborators πŸ˜ƒ
Tweet media one
0
2
17
@vwxyzjn
Costa Huang
4 years
My latest work Action Guidance, co-authored with @santiontanon , just got featured at @weights_biases 's Gallery! Big thanks to @lavanyaai for editing and featuring my report! Our use of wandb has really helped us to discover insights and since become part of my daily workflow.
@lavanyaai
Lavanya 🐝
4 years
Our first featured report today is an original work by @vwxyzjn ! He presents a novel technique, action guidance, that helps an RL agent learn the optimal policy in a game with sparse rewards, while sampling efficiently. #deeplearning #machinelearning
1
3
11
0
1
15
@vwxyzjn
Costa Huang
1 year
@EugeneVinitsky @m_wulfmeier Scaling is definitely something we want to improve for @cleanrl_lib . We are going to release *Cleanba* soon πŸ™‚. Cleanba is CleanRL-style distributed DRL with IMPALA and PPO. Implemented in JAX, Run on TPU. ~800 lines of code.
Tweet media one
2
1
16