Xuechen Li @lxuechen Twitter profile

Last Seen Profiles

@Flocavenia

@jisraperez

@Scarlotte766262

@februalovee

@WALLA_MAJIC

@NEXTGolfTour

@dawggirl27

@0kTRAND

@MinesAthletics

@yumelodii

@xeta

@LexanNark

@AMARIEBL

@nabilLaveriT

@childersmixtape

@Canope_68

@unmeshdesai

@paavola_lauri

@NorTi89388

@FierceHealth

@notyourevelia

@w3gpt_ai

@CoriBarnard1

@Davehale32

@sarahmei

@eldaajvazii

@pengen_stw

@AiyaHamrouni

@HinojosaSh37811

@ofulover

@rotinober

@OpenFPL_DAO

@highIyfavored

@caaannsu

@Kari_Squared

@7bp__

Xuechen Li

@lxuechen

3 years

I asked Codex to write a neural net in PyTorch. It then generated the exact code from an official PyTorch tutorial verbatim. Large LMs generalize surprisingly well sometimes, but training data memorization is both real and not uncommon.

9

27

298

Xuechen Li

@lxuechen

7 months

Belatedly, I finally had a chance to update the AlpacaFarm paper with DPO results. TL;DR: DPO performs similarly to RLHF+PPO but is much more memory-friendly. Previously, PPO fine-tuning took ~2 hours on 8 A100 GPUs. Our DPO runs take about the same time on 4 GPUs. DPO with LoRA

4

28

236

Xuechen Li

@lxuechen

1 year

Last week, I kinda went from being an ML researcher to an ML engineer, then to a backend developer, then to a front end developer, then to a site reliability engineer. Glad to be back to research now.

12

9

166

Xuechen Li

@lxuechen

1 year

Really grateful to be named a Meta PhD Fellow. Will keep up the work on addressing challenges in privacy, security, and safety in ML.

Announcing the 2023 Meta Research PhD Fellowship award winners | Meta Research

...

research.facebook.com

4

11

148

Xuechen Li

@lxuechen

6 months

i hope in 2024 there'd be more works discussing how capabilities emerge from specific patterns of data, rather than just model scale

7

2

51

Xuechen Li

@lxuechen

4 years

This paper from 1988 by LeCun is probably one of the first to introduce Pontryagin's maximum principle to ML. Notably, the adjoint sensitivity method used to train neural ODEs is also based on this principle. The adjoint (co-state) variable is basically a Lagrange multiplier.

Moustapha Cisse

@Moustapha_6C

4 years

@__lao__ @jerofad brought to my attention this old gem by @ylecun It explains backprop using lagrangian formalism, which (more its continuous version) is common in optimal control theory (as discussed in class).

2

6

61

1

3

37

Xuechen Li

@lxuechen

1 year

✅Training code is out. It's only 200+ lines! ✅Improved capacity for our demo. Things should now run much more smoothly.

Rohan Taori

@rtaori13

1 year

🔥🔥 Training code (and data) for Alpaca is now RELEASED! 🔥🔥 Incredibly quick work by @lxuechen @Tianyi_Zh . If you have access to LLaMA, you can now train your own Alpacas!! We also added more capacity to the demo, try it out!

3

32

151

1

0

19

Xuechen Li

@lxuechen

7 months

Here’s the updated paper and code for full context and reproducibility: paper: code:

AlpacaFarm: A Simulation Framework for Methods that Learn from...

Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their strong instruction-following abilities. Developing these LLMs involves a complex yet poorly understood...

arxiv.org

1

3

17

Xuechen Li

@lxuechen

7 months

Heading to neurips now. Be there Mon-Fri (Dec 11-15). Feel free to reach out to chat about research on instruction-following, RLHF, data collection, security and privacy.

1

0

17

Xuechen Li

@lxuechen

1 year

It's been more than a month since we trained our first batch of RLHF models, but we wanted to do more and enable others to do this type of research as well. That's why we built a simulator where RLHF research can be done without incurring the high time and dollar cost.

Tatsunori Hashimoto

@tatsu_hashimoto

1 year

We are releasing AlpacaFarm, a simulator enabling everyone to run and study the full RLHF pipeline at a fraction of the time (<24h) and cost (<$200) w/ LLM-simulated annotators. Starting w/ Alpaca, we show RLHF gives big 10+% winrate gains vs davinci003 ()

7

134

649

3

0

16

Xuechen Li

@lxuechen

1 year

Researching large language models in academia is hard because open models usually perform much worse than closed models. This makes studying interesting phenomena and safety aspects hard.

Tatsunori Hashimoto

@tatsu_hashimoto

1 year

Instruction-following models are now ubiquitous, but API-only access limits research. Today, we’re releasing info on Alpaca (solely for research use), a small but capable 7B model based on LLaMA that often behaves like OpenAI’s text-davinci-003. Demo:

43

340

1K

1

2

14

Xuechen Li

@lxuechen

7 months

Interesting application of the Jacobi method with the extra lookahead. Reminds me of a paper I read several years ago that applied this to sampling autoregressive PixelCNN++.

lmsys.org

@lmsysorg

7 months

Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. Blog: Code:

23

247

1K

0

11

Xuechen Li

@lxuechen

1 year

Evaluation is hard because human annotators don't always agree. There are also tricky tradeoffs like helpfulness vs harmlessness.

Jeremy Howard

@jeremyphoward

1 year

I've been looking more closely into the evaluation based on human preferences in the draft Open Assistent (OA) paper, and I'm finding it's actually a really interesting case study in how tricky evaluation is... 🧵

8

91

652

0

9

Xuechen Li

@lxuechen

7 months

@katherine1ee Most of the results on ChatGPT in the paper seem to be based on 3.5. Have you guys measured the extraction rate for GPT4?

1

0

8

Xuechen Li

@lxuechen

2 years

Lots of progress!

alewkowycz

@alewkowycz

2 years

Very excited to present Minerva🦉: a language model capable of solving mathematical questions using step-by-step natural language reasoning. Combining scale, data and others dramatically improves performance on the STEM benchmarks MATH and MMLU-STEM.

108

1K

8K

0

7

Xuechen Li

@lxuechen

1 year

Ever wondered about the legal implications of machine learning models generating content similar to copyrighted material? Our draft might provide some clarifying thoughts.

Peter Henderson

@PeterHndrsn

1 year

Wondering about the latest copyright issues related to foundation models? Check out the draft of our working paper: Foundation Models and Fair Use Link: With wonderful co-authors @lxuechen @jurafsky @tatsu_hashimoto @marklemley @percyliang 🧵👇

1

39

118

0

1

7

Xuechen Li

@lxuechen

6 months

@QuanquanGu @_zxchen_ @Yihe__Deng @HuizhuoY Cool work! Though w/o external feedback, do the self-play models plateau at the quality of the SFT data?

2

0

7

Xuechen Li

@lxuechen

7 months

Friends built some really cool stuff!

Qi Linzhi

@LinzhiQ

7 months

We built a lil toy to give GPTs access to your Mac: Me: complain to my landlord for me and my roommates pls 🥺 GPT: (reads iMessage chat with roomies, summarizes complaints, finds our landlord’s email, pulls up a drafted email)

11

16

117

0

6

Xuechen Li

@lxuechen

2 years

@DocSparse @github Wasn't able to extract your code with that particular prompt, but by prompting with the first line of the function, I was able to get the full function body.

1

0

6

Xuechen Li

@lxuechen

1 year

Will be around. Love to talk about open-source ML research!

clem 🤗

@ClementDelangue

1 year

I feel bad fueling the FOMO more but I can't restrain myself from sharing that it sounds like we might have both the alpaca author and the alpacas themself at the meetup tomorrow! I'm not sure which one I'm most excited about

11

12

198

0

1

5

Xuechen Li

@lxuechen

7 months

@EdwardSun0909 Great point and thanks for the question! We can train models on human preferences, but running human eval with the same pool of annotators would be tricky since we've already shut down the human annotation pipeline. Aside, we were very careful with how we constructed our

1

4

Xuechen Li

@lxuechen

3 years

@Massastrello @MichaelPoli6 @Diffeq_ml Really glad that you guys implemented parareal, as this has been on my todo list for ages, and yet I never got the time to do the work. Would love to try out the version you guys have at some point!

1

0

4

Xuechen Li

@lxuechen

1 year

@_mohansolo @tatsu_hashimoto Thanks for the kind words! In our automated and preliminary human evaluations, Alpaca is on par with 003 for the test suite we tried. Though, we're aware that our test suite is still relatively small, and could be expanded.

0

4

Xuechen Li

@lxuechen

2 years

Nice work from researchers at DeepMind on getting differentially private learning to work much better on ImageNet classification.

3

0

4

Xuechen Li

@lxuechen

7 months

@srush_nlp We need scalable oversight techniques :)

0

4

Xuechen Li

@lxuechen

4 years

@diegojavierzea @DavidDuvenaud @SciML_Org Hi Diego, the main goal of this release is to provide a reference implementation of our stochastic adjoint method, a new memory-efficient method we proposed for computing gradients through SDEs, and how we used it for variational inference.

1

0

4

Xuechen Li

@lxuechen

2 years

Had similar experiences reviewing for ICLR and ICML. It's not uncommon to see the technical bits of papers being glossed over.

Thomas Steinke

@shortstein

2 years

I reviewed 3 papers for ICML 2022 @icmlconf . All 3 had fundamental flaws (like theorems that are demonstrably false). None of the other 6 reviewers noticed these flaws and gave 👍. This is the sorry state of ML conferences...

46

122

1K

0

3

Xuechen Li

@lxuechen

1 year

@tengyuma @HongLiu9903 @zhiyuanli_ @dlwh @percyliang @StanfordAILab @stanfordnlp @StanfordCRFM @Stanford Is this natural gradient with diagonal approximation to Fisher? FWIW some implementations of KFAC do EMA estimates of the preconditioner with infrequent updates.

1

0

3

Xuechen Li

@lxuechen

4 years

@stuartrfarmer @DavidDuvenaud Funny you mentioned this, a proof of concept was actually done in JAX during the early stages, and we still use that codebase for our own experimentation from time to time. Though the JAX version probably have a less stable API, so we decided to release the torch version.

2

0

3

Xuechen Li

@lxuechen

4 years

@kchonyc Is there a preprint of Chandel, Joseph and Ranganath (2019)?

0

2

Xuechen Li

@lxuechen

4 years

@diegojavierzea @DavidDuvenaud @SciML_Org We don't think this is fully supported yet by the Julia libraries. We know that Julia has better handling for stiff systems. We would love to discuss with people on the Julia end about how we can make both sides better.

1

0

3

Xuechen Li

@lxuechen

3 years

@_arohan_ An earlier paper showed that DP fine-tuned text classifiers are resilient against generic membership inference attacks.

Large Scale Private Learning via Low-rank Reparametrization

We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks, which are 1) the huge memory cost of storing individual gradients,...

arxiv.org

0

3

Xuechen Li

@lxuechen

7 months

@TheGregYang @sama @MicrosoftTeams quick call on teams, and the app just wouldn't load...

0

3

Xuechen Li

@lxuechen

4 years

@ArthurGretton @MichaelArbel Nice work! On a side note, what's the relationship between your paper and on the algorithmic side?

1

0

3

Xuechen Li

@lxuechen

3 years

@MichaelPoli6 @Massastrello @Diffeq_ml @PatrickKidger @ChrisRackauckas @shaojieb @YSongStanford @jacobjinkelly Thanks for the list! BTW, we have an implementation of score-matching SDEs for a toy problem implemented here .

0

3

Xuechen Li

@lxuechen

1 year

@florian_tramer GPT4 responses here are generated afresh? I seem to get responses that don’t fully align with the explanation.

1

0

3

Xuechen Li

@lxuechen

4 years

@victorveitch Congrats!

0

2

Xuechen Li

@lxuechen

1 year

@zzhhhhhhhzz @tatsu_hashimoto Thanks for giving the demo a try! You're right. We did notice that the model wasn't generally capable in translation. This is most likely because our fine-tuning data contained few examples of translation.

2

0

2

Xuechen Li

@lxuechen

3 years

@seo_wala Yeah, I think it's definitely capable of being a smart search engine.

0

2

Xuechen Li

@lxuechen

2 years

@PhilippHennig5 @maosbot @HansKersting Congrats for the great book!!! It’s also so nice to see a book with solutions to exercises!

1

0

2

Xuechen Li

@lxuechen

3 years

@_arohan_ Really, nice work, Rohan! Thought I might bring to your attention our recent work on DP language model fine-tuning, which you may find interesting.

2

0

2

Xuechen Li

@lxuechen

1 year

Just to be clear, this was team work. But I genuinely did a lot of engineering.

0

2

Xuechen Li

@lxuechen

1 year

We’re excited about the potential to use these models to study safety issues and further improve the trustworthiness of instruction-following models.

0

2

Xuechen Li

@lxuechen

3 years

@apmotapinto I joined the waitlist for their beta trial early on and was fortunate to hear back soon. Link to register for beta trial:

1

0

2

Xuechen Li

@lxuechen

7 months

@Euclaise_ Haven't gotten to this yet and can't promise much in the short-term. We'd love to work with the community to get this done though.

0

2

Xuechen Li

@lxuechen

3 years

@atg_abhishek I think that's a likely explanation. I've seen parts of this snippet floating around at places like Stack Overflow -- the entire chunk appearing altogether seems less common.

0

2

Xuechen Li

@lxuechen

1 year

@IntuitMachine @tatsu_hashimoto @MetaAI @yizhongwyz @OpenAI @rtaori13 @__ishaan @Tianyi_Zh @yanndubs @guestrin @percyliang For inference in bf16, a single A100 40G is sufficient. 8-bit inference would further reduce the memory requirement but may reduce speed or quality.

0

2

Xuechen Li

@lxuechen

1 year

@LechMazur @tatsu_hashimoto Hi Lech, we have fine-tuned larger LLaMA variants. But due to safety concerns and our inability to serve very large models, we decided not to use them for the demo.

1

0

2

Xuechen Li

@lxuechen

2 years

@_arohan_ Congrats!

0

1

Xuechen Li

@lxuechen

1 year

@yizhongwyz Thanks for the kind words! We’re working to uncover the failure modes, improve on the current version, and sharing the knowledge.

0

1

Xuechen Li

@lxuechen

3 years

@PatrickKidger Congrats on the paper again! It was fun working together, and I learned a lot!

0

1

Xuechen Li

@lxuechen

3 years

@limits_stop Haha, yeah, I think especially likely if the human had access to wifi. Though the fact that the exact variable names are used suggests the type of memorization is pretty strong.

0

1

Xuechen Li

@lxuechen

3 years

@yass_ouali Yes, I have no doubt that this could be a good use case.

0

1

Xuechen Li

@lxuechen

3 years

@florian_tramer @Stanford @danboneh @CSatETH @GoogleAI @aterzis Congrats, Florian!

0

1

Xuechen Li

@lxuechen

1 year

We release Alpaca, a powerful 7B instruction-tuned model which often behaves like OpenAI’s text-davinci-003 in preliminary human evaluation. We hope this release will help academic researchers who want to study instruction-tuned models without being limited to only API access.

1

0

1

Xuechen Li

@lxuechen

7 months

@TheGregYang @sama @MicrosoftTeams hilarious

0

1

Xuechen Li

@lxuechen

2 years

@_sam_sinha_ @UofT @igilitschenski @DaveLindell @UofTCompSci Nice! Congrats!

1

0

1

Xuechen Li

@lxuechen

3 years

@kcjacksonwang @WuTsaiAlliance @Stanford Congrats!

0

1

Xuechen Li

@lxuechen

4 years

@stuartrfarmer @DavidDuvenaud For writing general programs with differentiable operations, I'd recommend JAX. I don't work on graphics (at least not atm), but taichi might make more sense in that realm.

1

0

1

Xuechen Li

@lxuechen

2 years

@thegautamkamath Thanks for the DP course! I learned a lot from reading the notes. Perhaps one minor comment is that there's no solution set for the HW problems, so I never got to check my answers for those :)

0

1

Xuechen Li

@lxuechen

4 years

@timudk Thanks for putting the derivation in modern notation. Though what you have derived there is essentially Pontryagin's maximum principle, which was discovered in the 1960~70s, if not even earlier.

0

1

Xuechen Li

@lxuechen

2 years

@agarwl_ Sorry to hear. Hope you feel better soon!

0

1

Xuechen Li

@lxuechen

1 year

@Haoxiang__Wang @tatsu_hashimoto @yizhongwyz @rtaori13 @__ishaan @Tianyi_Zh @yanndubs @guestrin @percyliang Hi Haoxiang, thanks for the question. Indeed, LLaMA-7B fits on a single A100, and depending on the context length, training could be done with DDP. The FSDP point in the blog post was meant to be a general point, as we also fine-tuned larger variants of LLaMA.

1

0

1

Xuechen Li

@lxuechen

3 years

@wgrathwohl Congrats, Will!

0

1

Xuechen Li

@lxuechen

6 months

@vince62s @maximelabonne @erhartford @mrm8488 good point. thanks for sharing. this shouldn't be a problem for the existing models I trained, given the model defn was frozen in the spaces repo (see this )

config.json · lxuechen/phi-2-dpo at 4077ba921e62995042b5241977c119c70e5f2e27

huggingface.co

2

0

1

Xuechen Li

@lxuechen

3 years

@mengyer @nyuniversity @CILVRatNYU @NYUDataScience @zemelgroup @RaquelUrtasun congrats!

1

0

1

Xuechen Li

@lxuechen

4 years

@JoshKoomz @DavidDuvenaud @ylecun @jarthurgross Thanks! We had some demos with heavy-tailed likelihood models before, and posterior samples there were usually quite diffuse. The demo in this thread is based on Gaussian likelihoods, so the samples are much more concentrated.

0

1

Xuechen Li

@lxuechen

28 days

@lateinteraction Congrats!

0

1

Xuechen Li

@lxuechen

3 years

@yanii @UCalgary @SchulichENGG Congrats, Yani!

0

1

Xuechen Li

@lxuechen

4 years

@ChrisRackauckas @norberg_jon @DavidDuvenaud @rtqichen @wongtkleonard The ODEs we consider usually aren't stiff. When they really become so, we might still have the option of tweaking the architecture or regularizing the dynamics, without sacrificing the ability to complete our task at hand.

2

0

1

Xuechen Li

@lxuechen

3 years

@PatrickKidger @DavidDuvenaud Congrats, Dr Kidger!

0

1

Xuechen Li

@lxuechen

4 years