Edward Hu @edwardjhu Twitter profile

Last Seen Profiles

@MargretWatson69

@Dappatron

@kig_Elle

@aljufi2012

@zcyouo

@OM_rayan_17

@nouvelordrecc

@cukienaknikmati

@Corner_Fcb

@drkkr

@Losblancos_sora

@ASCLibrary

@klau_deluca

@bntlry870663

@Kyle_probably

@last_inks

@bnt_s56246

@Salycurvymom

@RomeRamirez

@Clover_0315

@bntlry870663

@Blau_sin_Grana

@waxi_betico

@UpstateGoombah

@JJohnson_RES

@ToonIndySmile

@fouxk2

@LenaPenelope_

@gatadchalet

@HuismaDeean

@Zkabadu

@yxm84

@AnyhYmnyh40403

@yellowstune

@kairiutada

Edward Hu

@edwardjhu

3 years

After two wonderful years at Microsoft, I’m happy to share with you that I’ll join Mila in Jan 2022 as a PhD student advised by Yoshua Bengio!🚀

11

4

547

Edward Hu

@edwardjhu

3 years

Better hyperparameters🚀= bigger bang for the buck💸! What if the next trillion-parameter model could be tuned by running a tiny one w millions of params instead? Our technique, μTransfer, enables that by aligning the optimal HPs across model sizes.

3

53

398

Edward Hu

@edwardjhu

3 years

GPT-3 175B is powerful but too expensive to serve many finetuned copies. We use low-rank adaptation (LoRA) to learn task modules that are 10,000x smaller and can be swapped while the main model is frozen. No extra inference latency or quality drop! Paper:

6

73

369

Edward Hu

@edwardjhu

1 year

μTransfer allowed us to find better hyperparameters for GPT-3 on a single GPU. It has since been used by Cerebras and Google to train huge models. But how does it actually work? What is the enabling insight? Get your answers in 7 minutes👇

μTransfer: Tuning GPT-3 hyperparameters on one GPU | Explained by the...

How can one tune the hyperparameters of an enormous neural network like GPT-3 on a single GPU?**Like, subscribe, and share if you find this video valuable!**...

www.youtube.com

2

34

234

Edward Hu

@edwardjhu

3 years

Sampling from energy functions is fundamental to machine learning, but mostly done by expensive MCMC. GFlowNet is a deep learning way to amortize that cost and can produce diverse samples for RL, NLP, etc. Our latest work builds its theoretical foundation

3

37

212

Edward Hu

@edwardjhu

3 years

In June we released LoRA which adapts NNs as big as GPT-3 with few parameters yet stays performant. Our new result beats finetuned RoBERTa on GLUE with 1/8 of total parameters! Try "pip install loralib" and add LoRA to your fav model in 10 lines of code!

2

13

88

Edward Hu

@edwardjhu

11 months

OpenAI is nothing without its people.

4

3

74

Edward Hu

@edwardjhu

1 month

proud to see what i worked on at OpenAI finally shipped! go 🐢!!

Sam Altman

@sama

1 month

here is o1, a series of our most capable and aligned models yet: o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.

737

3K

18K

3

1

73

Edward Hu

@edwardjhu

9 months

Low-rank adaptation (LoRA) is one of the most popular methods for customizing large AI models.🤖 🤔What's the story behind its invention? 💡How did we come up with the idea? 🔧Should you use it? Get your answers in my new video🙌📽️

2

12

64

Edward Hu

@edwardjhu

7 months

🤨Should you care about GFlowNets? What are they anyway?🧐 Learn about how GFlowNets speed up drug discovery and help large language models reason better in my new video!🔬📚

Are GFlowNets the future of AI?

Should you care about GFlowNets? What are they anyway? Learn about how GFlowNets are aiding drug discovery and reasoning in large language models!**Like, sub...

www.youtube.com

5

17

64

Edward Hu

@edwardjhu

1 year

🤔Should you few-shot prompt or fine-tune an LLM when you have limited training data? 💡Here is a better option! 🆕Our paper uses amortized inference to extract hard-to-access knowledge from LLMs and boost data efficiency for reasoning. More in 🧵👇

1

17

63

Edward Hu

@edwardjhu

4 years

Finally an ♾-width limit that describes *practical* NNs we use. My great pleasure working on this project. More to come - stay tuned!

Greg Yang

@TheGregYang

4 years

1/ Existing theories of neural networks (NN) like NTK don't learn features so can't explain success of pretraining (e.g. BERT, GPT3). We derive the *feature learning* ∞-width limit of NNs & pretrained such an ∞-width word2vec model: it learned semantics!

4

57

384

2

3

44

Edward Hu

@edwardjhu

5 months

How can we keep scaling if compute grows exponentially but data does not?🤔 Find the answer in my guest lecture at Stanford CS25 this afternoon on our ICLR2024 Honorable Mention paper "Amortizing intractable inference in large language models" 🙌

CS25: Tranformers United!

Disussing the latest breakthroughs with Transformers in diverse domains

web.stanford.edu

3

41

Edward Hu

@edwardjhu

2 years

Can AI explain data with complex latent structures, like graphs? Classic EM algo fits only simple latent variable models like Gaussian mixture & HMM. Our GFlowNet-EM uses a big NN to do the hard work & explains data with complex compositional latents. 👉

4

5

36

Edward Hu

@edwardjhu

3 years

Wide Neural Net = NTK = Kernel Machine? We beg to differ in my latest blogpost on how a new parametrization enables feature learning even in the infinite-width limit. Thanks Yoshua Bengio and @ilyasut for cosigning our work! Paper:

Microsoft Research

@MSFTResearch

3 years

The infinite-width limit is key to overparametrized neural networks, yet existing theories don’t allow feature learning in practical networks. Learn how @TheGregYang & @EdwardJHu use the Maximal Update Parameterization to unlock feature learning: #ICML2021

2

29

108

3

4

30

Edward Hu

@edwardjhu

4 years

Img rotation/translation are natural but huge wrt Lp dist. Thus we consider adversarial attack in Wasserstein dist, after @RICEric22 .But under current threat model, defenses can hilariously be broken by dimming the img. We devote to a fix &stronger attacks

1

4

23

Edward Hu

@edwardjhu

1 year

Really flattered that this cited 2 of my creations! (LoRA & μTransfer) - The OSS progress is awesome, but can we turn demos into great products? The moat might be invisible: infra, feedback - Next breakthrough in robustness and reasoning 👉 potential moat

0

4

22

Edward Hu

@edwardjhu

2 years

Are there fundamental limits to models like #ChatGPT ?🤔 How can we leverage scaling to fix them? In this blog post, Yoshua and I discuss long-term research directions that complement the magic of large neural networks.🚀

Scaling in the service of reasoning & model-based ML - Yoshua Bengio

Co-written with my PhD student Edward J. Hu. Scaling seems to work really well. We must be cautious in pursuing research directions that build knowledge…

yoshuabengio.org

1

3

20

Edward Hu

@edwardjhu

3 years

In collab with @OpenAI , we used μTransfer to tune GPT-3 6.7B with only 7% of its pretraining compute by tuning a 40M-param model instead. Even better — as long as we underfit, wider is always better with the same HPs. Try it w/ your model!

1

2

20

Edward Hu

@edwardjhu

1 year

Excited about sharing my story!

Cohere For AI

@CohereForAI

1 year

Join our Beginners in Research-Driven Studies (BIRDS) Group on Oct. 5th as they welcome @edwardjhu to present "Insights into LoRA, μTransfer, and the Art of Reasoning, Plus Valuable Advice for Beginners and More." Learn more:

1

8

35

2

1

18

Edward Hu

@edwardjhu

3 years

We describe μTransfer in more detail and dive into its origin in deep learning theory in a blog post I co-wrote with @TheGregYang and @JianfengGao0217 🙌.

µTransfer: A technique for hyperparameter tuning of enormous neural networks - Microsoft Research

Great scientific achievements cannot be made by trial and error alone. Every launch in the space program is underpinned by centuries of fundamental research in aerodynamics, propulsion, and celestial...

www.microsoft.com

1

2

16

Edward Hu

@edwardjhu

1 year

I get to share this today thanks to my amazing coauthors @TheGregYang @ibab_ml @sidorszymon David Farhi, Nick Ryder, @merettm @AllenLao @WeizhuChen @JianfengGao0217 . Special thanks to @isafulf @motiwari2 and @karpathy for feedback on my first video!🙏

0

13

Edward Hu

@edwardjhu

3 years

We learn the updates to attn weight matrices using pairs of rank decomposition matrices with rank r. Turns out r=4 suffices even when the full rank is 12288! We verify LoRA on GPT-2/3 and outperform baselines including adapter and prefix tuning. Code:

1

14

Edward Hu

@edwardjhu

4 years

We are honored to receive this award for our paper! Kudos to my awesome co-authors @hadisalmanX @adith387 @TheGregYang . Please tune in for our talk: and poster session: !

Adith Swaminathan, Edward Hu, Greg Yang, Hadi Salman · Improved Wasserstein Attacks and Defenses ·...

Professional Conference Recording

slideslive.com

Nicolas Papernot

@NicolasPapernot

4 years

Improved Wasserstein Attacks and Defenses by J. Edward Hu (Microsoft Research AI); Greg Yang (Microsoft Research AI); Adith Swaminathan (Microsoft Research); Hadi Salman (Microsoft Research)

1

2

27

2

10

Edward Hu

@edwardjhu

1 year

Wanna hear the story behind LoRA and muTransfer, how they democratized generative AI, and new research on reasoning capabilities? Come join us next Monday 7/17 @ 1pm PT for a virtual fireside chat with @apbhatnagar from @pearvc !

From LLMs to Reasoning: The Next Frontier in AI

gatsby.events

0

1

10

Edward Hu

@edwardjhu

3 years

As always, kudos to my amazing mentor @TheGregYang , wonderful folks at OpenAI @ibab_ml , @sidorszymon , David Farhi, Nick Ryder, @merettm , and our awesome team at MSFT @AllenLao , @WeizhuChen , @JianfengGao0217 !

0

9

Edward Hu

@edwardjhu

1 year

- Great example of better reasoning through separating knowledge and inference in LLMs - System 2 reasoning = giving structure to System 1 computation (here it's MCTS) - LLMs are versatile knowledge model: just ask how confident / helpful an action is!

Reasoning with Language Model is Planning with World Model

Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still...

arxiv.org

0

1

9

Edward Hu

@edwardjhu

4 years

Code is available at: . This is a collaboration with my friends and mentors @HadiSalmanX @adith387 @thegregyang . Special thanks to @tonyduan_ @akapoor_av8r @RICEric22 !

GitHub - edwardjhu/improved_wasserstein: Code for our ICLR Trustworthy ML 2020 workshop paper...

Code for our ICLR Trustworthy ML 2020 workshop paper "Improved Image Wasserstein Attacks and Defenses" - edwardjhu/improved_wasserstein

github.com

0

3

9

Edward Hu

@edwardjhu

2 years

LoRA is now officially supported in HuggingFace!🤗 so excited to see it being used for diffusion models😍

1

2

9

Edward Hu

@edwardjhu

1 year

Excited to see muP being extended to infinite depth as well!

Greg Yang

@TheGregYang

1 year

Nontrivial ∞width neural nets are either kernel machines or feature learners. Latter's scaling makes optimal hyperparams invariant to width What if depth→∞as well? 🆕 Feature diversity is key; maxed out by abs (not relu); gives invariance to depth! But GPT flawed 🧵

162

338

2K

0

1

8

Edward Hu

@edwardjhu

3 years

Big shout out to my amazing collaborators who made this happen: Yelong Shen, Phil Wallis, @ZeyuanAllenZhu , Yuanzhi Li, Shean Wang, and @WeizhuChen .💪🎉

2

0

7

Edward Hu

@edwardjhu

3 years

We also show how GFlowNets can estimate free energies and be driven by outcomes. We believe this will unlock new capabilities for DL. Joint work with Yoshua Bengio, @TristanDeleu , Salem Lahlou, @mo_tiwari , and @folinoid . Figures are from

0

7

Edward Hu

@edwardjhu

4 years

Clamping to [0, 1] is natural for Lp PGD attacks, but can produce unsound perturbations for Wasserstein attacks. We modify the projection algorithm to eliminate the need for clamping, which allows us to explore much stronger attacks and adversarially train against them.

1

6

Edward Hu

@edwardjhu

3 years

Why does such a low rank work for adaptation? We train LoRA modules with r=64 using 2 seeds -- the few top singular directions they learn largely overlap. Adaptation also amplifies snglr directions already important but not emphasized in the original weights. More in our paper.

1

0

6

Edward Hu

@edwardjhu

1 year

👩‍💻The code to reproduce our experiments: Super grateful to my collaborators: @JainMoksh @EricElmoznino @you_kad @g_lajoie_ Yoshua Bengio, and Kolya Malkin!

GitHub - GFNOrg/gfn-lm-tuning

Contribute to GFNOrg/gfn-lm-tuning development by creating an account on GitHub.

github.com

0

2

6

Edward Hu

@edwardjhu

1 year

1

0

5

Edward Hu

@edwardjhu

3 years

Many thanks to @JianfengGao0217 , Jade Huang, Jiayuan Huang, @XiangLisaLi2 , @AllenLao , Yabin Liu, @ben_vandurme , Luis Vargas, Haoran Wei, @npew , and @TheGregYang who gave us valuable feedback on the draft!🙏

0

5

Edward Hu

@edwardjhu

1 year

Super lucky to have you as my mentor while you were there! 🙏💯

Greg Yang

@TheGregYang

1 year

What an incredible journey it's been over these 5+ yrs @MSFTResearch . I still remember the eureka moments, in the serenity of building 99 past midnight, leading to Tensor Programs & μP. Forever grateful for MSR taking a chance on a kid straight out of undergrad.

4

227

1

0

5

Edward Hu

@edwardjhu

3 years

We release all the LoRA checkpoints from our result as well as the scripts to reproduce them. The new result will be included in a revision of our pre-print soon. Check out the current draft here to learn about LoRA:

LoRA: Low-Rank Adaptation of Large Language Models

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full...

arxiv.org

1

0

4

Edward Hu

@edwardjhu

1 year

🛠️We use GFlowNets, diversity-seeking RL algorithms, to fine-tune LLMs to match the reward distribution. Reward fn➕NN➕compute➡️sampler of the (intractable) reward distribution! 🥂 Trivial example: we train an LLM to generate random numbers when the base model and PPO fail.

1

3

Edward Hu

@edwardjhu

3 years

Big shout-out to my great collaborator and mentor @TheGregYang , and Salem Lahlou and @megha_byte who provided helpful feedback on our post!

0

3

Edward Hu

@edwardjhu

1 year

🚧 Consider tasks such as infilling or finding good chains of thought (CoT). It’s easy to evaluate if a sample is good, but sampling from the (potentially multimodal) posterior distribution? Computational nightmare🙅‍♀️

1

0

1

Edward Hu

@edwardjhu

2 years

4 / One day, AI can learn to explain the world with even more complex compositional structures, e.g., graphs of causal relationships, like humans do. This will hopefully make AI much more robust in unseen situations because the underlying explanations often stay invariant.

2

0

2

Edward Hu

@edwardjhu

3 years

Shout-out to Lu Wang, Phil Wallis, and Yelong Shen for spearheading the NLU and NLG experiments!🚀 Once again thanks all who gave us valuable feedback on LoRA. @JianfengGao0217 @XiangLisaLi2 @AllenLao @ben_vandurme @npew @TheGregYang @SebastienBubeck @ZeyuanAllenZhu @WeizhuChen

0

2

Edward Hu

@edwardjhu

1 year

✅Solution: Amortized inference allows us to train neural networks to approximate those tricky distributions! Basically, we teach the model to “think” by sampling and aggregating CoT from the Bayesian posterior p(z|x, y) How is it done?😃

1

0

2

Edward Hu

@edwardjhu

1 year

In addition, we can better infill short stories and extract more diverse and more likely next sentences from an LLM!

1

2

Edward Hu

@edwardjhu

1 year

🧠LLMs compress knowledge via next-token prediction. Therefore, tractable inference over this knowledge is limited to generating from start to end 📖➡️ What's the problem?🤷‍♀️

1

0

1

Edward Hu

@edwardjhu

1 year

📈 How is this useful? By learning to sample from the posterior over CoT, our method beats few-shot learning and PPO on simple arithmetic while generalizing better to harder questions! 🤖