Xuechen Li Profile
Xuechen Li

@lxuechen

3,152
Followers
858
Following
3
Media
261
Statuses

AI @xai . PhD @Stanford . Undergrad @UofT . Worked at @GoogleAI @MSFTResearch @Vectorinst . I go by Chen.

Joined June 2015
Don't wanna be here? Send us removal request.
@lxuechen
Xuechen Li
3 years
I asked Codex to write a neural net in PyTorch. It then generated the exact code from an official PyTorch tutorial verbatim. Large LMs generalize surprisingly well sometimes, but training data memorization is both real and not uncommon.
Tweet media one
Tweet media two
9
27
298
@lxuechen
Xuechen Li
7 months
Belatedly, I finally had a chance to update the AlpacaFarm paper with DPO results. TL;DR: DPO performs similarly to RLHF+PPO but is much more memory-friendly. Previously, PPO fine-tuning took ~2 hours on 8 A100 GPUs. Our DPO runs take about the same time on 4 GPUs. DPO with LoRA
Tweet media one
4
28
236
@lxuechen
Xuechen Li
1 year
Last week, I kinda went from being an ML researcher to an ML engineer, then to a backend developer, then to a front end developer, then to a site reliability engineer. Glad to be back to research now.
12
9
166
@lxuechen
Xuechen Li
1 year
Really grateful to be named a Meta PhD Fellow. Will keep up the work on addressing challenges in privacy, security, and safety in ML.
4
11
148
@lxuechen
Xuechen Li
6 months
i hope in 2024 there'd be more works discussing how capabilities emerge from specific patterns of data, rather than just model scale
7
2
51
@lxuechen
Xuechen Li
4 years
This paper from 1988 by LeCun is probably one of the first to introduce Pontryagin's maximum principle to ML. Notably, the adjoint sensitivity method used to train neural ODEs is also based on this principle. The adjoint (co-state) variable is basically a Lagrange multiplier.
@Moustapha_6C
Moustapha Cisse
4 years
@__lao__ @jerofad brought to my attention this old gem by @ylecun It explains backprop using lagrangian formalism, which (more its continuous version) is common in optimal control theory (as discussed in class).
2
6
61
1
3
37
@lxuechen
Xuechen Li
1 year
✅Training code is out. It's only 200+ lines! ✅Improved capacity for our demo. Things should now run much more smoothly.
@rtaori13
Rohan Taori
1 year
🔥🔥 Training code (and data) for Alpaca is now RELEASED! 🔥🔥 Incredibly quick work by @lxuechen @Tianyi_Zh . If you have access to LLaMA, you can now train your own Alpacas!! We also added more capacity to the demo, try it out!
3
32
151
1
0
19
@lxuechen
Xuechen Li
7 months
Heading to neurips now. Be there Mon-Fri (Dec 11-15). Feel free to reach out to chat about research on instruction-following, RLHF, data collection, security and privacy.
1
0
17
@lxuechen
Xuechen Li
1 year
It's been more than a month since we trained our first batch of RLHF models, but we wanted to do more and enable others to do this type of research as well. That's why we built a simulator where RLHF research can be done without incurring the high time and dollar cost.
@tatsu_hashimoto
Tatsunori Hashimoto
1 year
We are releasing AlpacaFarm, a simulator enabling everyone to run and study the full RLHF pipeline at a fraction of the time (<24h) and cost (<$200) w/ LLM-simulated annotators. Starting w/ Alpaca, we show RLHF gives big 10+% winrate gains vs davinci003 ()
Tweet media one
7
134
649
3
0
16
@lxuechen
Xuechen Li
1 year
Researching large language models in academia is hard because open models usually perform much worse than closed models. This makes studying interesting phenomena and safety aspects hard.
@tatsu_hashimoto
Tatsunori Hashimoto
1 year
Instruction-following models are now ubiquitous, but API-only access limits research. Today, we’re releasing info on Alpaca (solely for research use), a small but capable 7B model based on LLaMA that often behaves like OpenAI’s text-davinci-003. Demo:
Tweet media one
43
340
1K
1
2
14
@lxuechen
Xuechen Li
7 months
Interesting application of the Jacobi method with the extra lookahead. Reminds me of a paper I read several years ago that applied this to sampling autoregressive PixelCNN++.
@lmsysorg
lmsys.org
7 months
Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. Blog: Code:
23
247
1K
0
0
11
@lxuechen
Xuechen Li
1 year
Evaluation is hard because human annotators don't always agree. There are also tricky tradeoffs like helpfulness vs harmlessness.
@jeremyphoward
Jeremy Howard
1 year
I've been looking more closely into the evaluation based on human preferences in the draft Open Assistent (OA) paper, and I'm finding it's actually a really interesting case study in how tricky evaluation is... 🧵
Tweet media one
8
91
652
0
0
9
@lxuechen
Xuechen Li
7 months
@katherine1ee Most of the results on ChatGPT in the paper seem to be based on 3.5. Have you guys measured the extraction rate for GPT4?
1
0
8
@lxuechen
Xuechen Li
2 years
Lots of progress!
@alewkowycz
alewkowycz
2 years
Very excited to present Minerva🦉: a language model capable of solving mathematical questions using step-by-step natural language reasoning. Combining scale, data and others dramatically improves performance on the STEM benchmarks MATH and MMLU-STEM.
Tweet media one
108
1K
8K
0
0
7
@lxuechen
Xuechen Li
1 year
Ever wondered about the legal implications of machine learning models generating content similar to copyrighted material? Our draft might provide some clarifying thoughts.
@PeterHndrsn
Peter Henderson
1 year
Wondering about the latest copyright issues related to foundation models? Check out the draft of our working paper: Foundation Models and Fair Use Link: With wonderful co-authors @lxuechen @jurafsky @tatsu_hashimoto @marklemley @percyliang 🧵👇
1
39
118
0
1
7
@lxuechen
Xuechen Li
6 months
@QuanquanGu @_zxchen_ @Yihe__Deng @HuizhuoY Cool work! Though w/o external feedback, do the self-play models plateau at the quality of the SFT data?
2
0
7
@lxuechen
Xuechen Li
7 months
Friends built some really cool stuff!
@LinzhiQ
Qi Linzhi
7 months
We built a lil toy to give GPTs access to your Mac: Me: complain to my landlord for me and my roommates pls 🥺 GPT: (reads iMessage chat with roomies, summarizes complaints, finds our landlord’s email, pulls up a drafted email)
11
16
117
0
0
6
@lxuechen
Xuechen Li
2 years
@DocSparse @github Wasn't able to extract your code with that particular prompt, but by prompting with the first line of the function, I was able to get the full function body.
Tweet media one
1
0
6
@lxuechen
Xuechen Li
1 year
Will be around. Love to talk about open-source ML research!
@ClementDelangue
clem 🤗
1 year
I feel bad fueling the FOMO more but I can't restrain myself from sharing that it sounds like we might have both the alpaca author and the alpacas themself at the meetup tomorrow! I'm not sure which one I'm most excited about
11
12
198
0
1
5
@lxuechen
Xuechen Li
7 months
@EdwardSun0909 Great point and thanks for the question! We can train models on human preferences, but running human eval with the same pool of annotators would be tricky since we've already shut down the human annotation pipeline. Aside, we were very careful with how we constructed our
1
1
4
@lxuechen
Xuechen Li
3 years
@Massastrello @MichaelPoli6 @Diffeq_ml Really glad that you guys implemented parareal, as this has been on my todo list for ages, and yet I never got the time to do the work. Would love to try out the version you guys have at some point!
1
0
4
@lxuechen
Xuechen Li
1 year
@_mohansolo @tatsu_hashimoto Thanks for the kind words! In our automated and preliminary human evaluations, Alpaca is on par with 003 for the test suite we tried. Though, we're aware that our test suite is still relatively small, and could be expanded.
0
0
4
@lxuechen
Xuechen Li
2 years
Nice work from researchers at DeepMind on getting differentially private learning to work much better on ImageNet classification.
3
0
4
@lxuechen
Xuechen Li
7 months
@srush_nlp We need scalable oversight techniques :)
0
0
4
@lxuechen
Xuechen Li
4 years
@diegojavierzea @DavidDuvenaud @SciML_Org Hi Diego, the main goal of this release is to provide a reference implementation of our stochastic adjoint method, a new memory-efficient method we proposed for computing gradients through SDEs, and how we used it for variational inference.
1
0
4
@lxuechen
Xuechen Li
2 years
Had similar experiences reviewing for ICLR and ICML. It's not uncommon to see the technical bits of papers being glossed over.
@shortstein
Thomas Steinke
2 years
I reviewed 3 papers for ICML 2022 @icmlconf . All 3 had fundamental flaws (like theorems that are demonstrably false). None of the other 6 reviewers noticed these flaws and gave 👍. This is the sorry state of ML conferences...
46
122
1K
0
0
3
@lxuechen
Xuechen Li
1 year
@tengyuma @HongLiu9903 @zhiyuanli_ @dlwh @percyliang @StanfordAILab @stanfordnlp @StanfordCRFM @Stanford Is this natural gradient with diagonal approximation to Fisher? FWIW some implementations of KFAC do EMA estimates of the preconditioner with infrequent updates.
1
0
3
@lxuechen
Xuechen Li
4 years
@stuartrfarmer @DavidDuvenaud Funny you mentioned this, a proof of concept was actually done in JAX during the early stages, and we still use that codebase for our own experimentation from time to time. Though the JAX version probably have a less stable API, so we decided to release the torch version.
2
0
3
@lxuechen
Xuechen Li
4 years
@kchonyc Is there a preprint of Chandel, Joseph and Ranganath (2019)?
0
0
2
@lxuechen
Xuechen Li
4 years
@diegojavierzea @DavidDuvenaud @SciML_Org We don't think this is fully supported yet by the Julia libraries. We know that Julia has better handling for stiff systems. We would love to discuss with people on the Julia end about how we can make both sides better.
1
0
3
@lxuechen
Xuechen Li
7 months
@TheGregYang @sama @MicrosoftTeams quick call on teams, and the app just wouldn't load...
0
0
3
@lxuechen
Xuechen Li
4 years
@ArthurGretton @MichaelArbel Nice work! On a side note, what's the relationship between your paper and on the algorithmic side?
1
0
3
@lxuechen
Xuechen Li
3 years
@MichaelPoli6 @Massastrello @Diffeq_ml @PatrickKidger @ChrisRackauckas @shaojieb @YSongStanford @jacobjinkelly Thanks for the list! BTW, we have an implementation of score-matching SDEs for a toy problem implemented here .
0
0
3
@lxuechen
Xuechen Li
1 year
@florian_tramer GPT4 responses here are generated afresh? I seem to get responses that don’t fully align with the explanation.
1
0
3
@lxuechen
Xuechen Li
4 years
@victorveitch Congrats!
0
0
2
@lxuechen
Xuechen Li
1 year
@zzhhhhhhhzz @tatsu_hashimoto Thanks for giving the demo a try! You're right. We did notice that the model wasn't generally capable in translation. This is most likely because our fine-tuning data contained few examples of translation.
2
0
2
@lxuechen
Xuechen Li
3 years
@seo_wala Yeah, I think it's definitely capable of being a smart search engine.
0
0
2
@lxuechen
Xuechen Li
2 years
@PhilippHennig5 @maosbot @HansKersting Congrats for the great book!!! It’s also so nice to see a book with solutions to exercises!
1
0
2
@lxuechen
Xuechen Li
3 years
@_arohan_ Really, nice work, Rohan! Thought I might bring to your attention our recent work on DP language model fine-tuning, which you may find interesting.
2
0
2
@lxuechen
Xuechen Li
1 year
Just to be clear, this was team work. But I genuinely did a lot of engineering.
0
0
2
@lxuechen
Xuechen Li
1 year
We’re excited about the potential to use these models to study safety issues and further improve the trustworthiness of instruction-following models.
0
0
2
@lxuechen
Xuechen Li
3 years
@apmotapinto I joined the waitlist for their beta trial early on and was fortunate to hear back soon. Link to register for beta trial:
1
0
2
@lxuechen
Xuechen Li
7 months
@Euclaise_ Haven't gotten to this yet and can't promise much in the short-term. We'd love to work with the community to get this done though.
0
0
2
@lxuechen
Xuechen Li
3 years
@atg_abhishek I think that's a likely explanation. I've seen parts of this snippet floating around at places like Stack Overflow -- the entire chunk appearing altogether seems less common.
0
0
2
@lxuechen
Xuechen Li
1 year
@IntuitMachine @tatsu_hashimoto @MetaAI @yizhongwyz @OpenAI @rtaori13 @__ishaan @Tianyi_Zh @yanndubs @guestrin @percyliang For inference in bf16, a single A100 40G is sufficient. 8-bit inference would further reduce the memory requirement but may reduce speed or quality.
0
0
2
@lxuechen
Xuechen Li
1 year
@LechMazur @tatsu_hashimoto Hi Lech, we have fine-tuned larger LLaMA variants. But due to safety concerns and our inability to serve very large models, we decided not to use them for the demo.
1
0
2
@lxuechen
Xuechen Li
2 years
@_arohan_ Congrats!
0
0
1
@lxuechen
Xuechen Li
1 year
@yizhongwyz Thanks for the kind words! We’re working to uncover the failure modes, improve on the current version, and sharing the knowledge.
0
0
1
@lxuechen
Xuechen Li
3 years
@PatrickKidger Congrats on the paper again! It was fun working together, and I learned a lot!
0
0
1
@lxuechen
Xuechen Li
3 years
@limits_stop Haha, yeah, I think especially likely if the human had access to wifi. Though the fact that the exact variable names are used suggests the type of memorization is pretty strong.
0
0
1
@lxuechen
Xuechen Li
3 years
@yass_ouali Yes, I have no doubt that this could be a good use case.
0
0
1
@lxuechen
Xuechen Li
1 year
We release Alpaca, a powerful 7B instruction-tuned model which often behaves like OpenAI’s text-davinci-003 in preliminary human evaluation. We hope this release will help academic researchers who want to study instruction-tuned models without being limited to only API access.
1
0
1
@lxuechen
Xuechen Li
4 years
@stuartrfarmer @DavidDuvenaud For writing general programs with differentiable operations, I'd recommend JAX. I don't work on graphics (at least not atm), but taichi might make more sense in that realm.
1
0
1
@lxuechen
Xuechen Li
2 years
@thegautamkamath Thanks for the DP course! I learned a lot from reading the notes. Perhaps one minor comment is that there's no solution set for the HW problems, so I never got to check my answers for those :)
0
0
1
@lxuechen
Xuechen Li
4 years
@timudk Thanks for putting the derivation in modern notation. Though what you have derived there is essentially Pontryagin's maximum principle, which was discovered in the 1960~70s, if not even earlier.
0
0
1
@lxuechen
Xuechen Li
2 years
@agarwl_ Sorry to hear. Hope you feel better soon!
0
0
1
@lxuechen
Xuechen Li
1 year
@Haoxiang__Wang @tatsu_hashimoto @yizhongwyz @rtaori13 @__ishaan @Tianyi_Zh @yanndubs @guestrin @percyliang Hi Haoxiang, thanks for the question. Indeed, LLaMA-7B fits on a single A100, and depending on the context length, training could be done with DDP. The FSDP point in the blog post was meant to be a general point, as we also fine-tuned larger variants of LLaMA.
1
0
1
@lxuechen
Xuechen Li
3 years
@wgrathwohl Congrats, Will!
0
0
1
@lxuechen
Xuechen Li
6 months
@vince62s @maximelabonne @erhartford @mrm8488 good point. thanks for sharing. this shouldn't be a problem for the existing models I trained, given the model defn was frozen in the spaces repo (see this )
2
0
1
@lxuechen
Xuechen Li
4 years
@JoshKoomz @DavidDuvenaud @ylecun @jarthurgross Thanks! We had some demos with heavy-tailed likelihood models before, and posterior samples there were usually quite diffuse. The demo in this thread is based on Gaussian likelihoods, so the samples are much more concentrated.
0
0
1
@lxuechen
Xuechen Li
28 days
0
0
1
@lxuechen
Xuechen Li
3 years
0
0
1
@lxuechen
Xuechen Li
4 years
@ChrisRackauckas @norberg_jon @DavidDuvenaud @rtqichen @wongtkleonard The ODEs we consider usually aren't stiff. When they really become so, we might still have the option of tweaking the architecture or regularizing the dynamics, without sacrificing the ability to complete our task at hand.
2
0
1
@lxuechen
Xuechen Li
3 years
0
0
1
@lxuechen
Xuechen Li
6 months
@SullyOmarr Probably true which is why it takes a few retries to get it in my case.
1
0
1
@lxuechen
Xuechen Li
2 years
1
0
1