Edward Hu Profile Banner
Edward Hu Profile
Edward Hu

@edwardjhu

3,473
Followers
36
Following
21
Media
88
Statuses

building something new | ex-OpenAI

Woodside, CA
Joined December 2019
Don't wanna be here? Send us removal request.
@edwardjhu
Edward Hu
3 years
After two wonderful years at Microsoft, I’m happy to share with you that I’ll join Mila in Jan 2022 as a PhD student advised by Yoshua Bengio!🚀
11
4
547
@edwardjhu
Edward Hu
3 years
Better hyperparameters🚀= bigger bang for the buck💸! What if the next trillion-parameter model could be tuned by running a tiny one w millions of params instead? Our technique, μTransfer, enables that by aligning the optimal HPs across model sizes.
Tweet media one
3
53
398
@edwardjhu
Edward Hu
3 years
GPT-3 175B is powerful but too expensive to serve many finetuned copies. We use low-rank adaptation (LoRA) to learn task modules that are 10,000x smaller and can be swapped while the main model is frozen. No extra inference latency or quality drop! Paper:
Tweet media one
Tweet media two
6
73
369
@edwardjhu
Edward Hu
1 year
μTransfer allowed us to find better hyperparameters for GPT-3 on a single GPU. It has since been used by Cerebras and Google to train huge models. But how does it actually work? What is the enabling insight? Get your answers in 7 minutes👇
2
34
234
@edwardjhu
Edward Hu
3 years
Sampling from energy functions is fundamental to machine learning, but mostly done by expensive MCMC. GFlowNet is a deep learning way to amortize that cost and can produce diverse samples for RL, NLP, etc. Our latest work builds its theoretical foundation
Tweet media one
Tweet media two
3
37
212
@edwardjhu
Edward Hu
3 years
In June we released LoRA which adapts NNs as big as GPT-3 with few parameters yet stays performant. Our new result beats finetuned RoBERTa on GLUE with 1/8 of total parameters! Try "pip install loralib" and add LoRA to your fav model in 10 lines of code!
Tweet media one
Tweet media two
2
13
88
@edwardjhu
Edward Hu
11 months
OpenAI is nothing without its people.
4
3
74
@edwardjhu
Edward Hu
1 month
proud to see what i worked on at OpenAI finally shipped! go 🐢!!
@sama
Sam Altman
1 month
here is o1, a series of our most capable and aligned models yet: o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.
Tweet media one
737
3K
18K
3
1
73
@edwardjhu
Edward Hu
9 months
Low-rank adaptation (LoRA) is one of the most popular methods for customizing large AI models.🤖 🤔What's the story behind its invention? 💡How did we come up with the idea? 🔧Should you use it? Get your answers in my new video🙌📽️
2
12
64
@edwardjhu
Edward Hu
7 months
🤨Should you care about GFlowNets? What are they anyway?🧐 Learn about how GFlowNets speed up drug discovery and help large language models reason better in my new video!🔬📚
5
17
64
@edwardjhu
Edward Hu
1 year
🤔Should you few-shot prompt or fine-tune an LLM when you have limited training data? 💡Here is a better option! 🆕Our paper uses amortized inference to extract hard-to-access knowledge from LLMs and boost data efficiency for reasoning. More in 🧵👇
Tweet media one
Tweet media two
1
17
63
@edwardjhu
Edward Hu
4 years
Finally an ♾-width limit that describes *practical* NNs we use. My great pleasure working on this project. More to come - stay tuned!
@TheGregYang
Greg Yang
4 years
1/ Existing theories of neural networks (NN) like NTK don't learn features so can't explain success of pretraining (e.g. BERT, GPT3). We derive the *feature learning* ∞-width limit of NNs & pretrained such an ∞-width word2vec model: it learned semantics!
Tweet media one
4
57
384
2
3
44
@edwardjhu
Edward Hu
5 months
How can we keep scaling if compute grows exponentially but data does not?🤔 Find the answer in my guest lecture at Stanford CS25 this afternoon on our ICLR2024 Honorable Mention paper "Amortizing intractable inference in large language models" 🙌
3
3
41
@edwardjhu
Edward Hu
2 years
Can AI explain data with complex latent structures, like graphs? Classic EM algo fits only simple latent variable models like Gaussian mixture & HMM. Our GFlowNet-EM uses a big NN to do the hard work & explains data with complex compositional latents. 👉
Tweet media one
4
5
36
@edwardjhu
Edward Hu
3 years
Wide Neural Net = NTK = Kernel Machine? We beg to differ in my latest blogpost on how a new parametrization enables feature learning even in the infinite-width limit. Thanks Yoshua Bengio and @ilyasut for cosigning our work! Paper:
Tweet media one
@MSFTResearch
Microsoft Research
3 years
The infinite-width limit is key to overparametrized neural networks, yet existing theories don’t allow feature learning in practical networks. Learn how @TheGregYang & @EdwardJHu use the Maximal Update Parameterization to unlock feature learning: #ICML2021
2
29
108
3
4
30
@edwardjhu
Edward Hu
4 years
Img rotation/translation are natural but huge wrt Lp dist. Thus we consider adversarial attack in Wasserstein dist, after @RICEric22 .But under current threat model, defenses can hilariously be broken by dimming the img. We devote to a fix &stronger attacks
Tweet media one
Tweet media two
1
4
23
@edwardjhu
Edward Hu
1 year
Really flattered that this cited 2 of my creations! (LoRA & μTransfer) - The OSS progress is awesome, but can we turn demos into great products? The moat might be invisible: infra, feedback - Next breakthrough in robustness and reasoning 👉 potential moat
0
4
22
@edwardjhu
Edward Hu
2 years
Are there fundamental limits to models like #ChatGPT ?🤔 How can we leverage scaling to fix them? In this blog post, Yoshua and I discuss long-term research directions that complement the magic of large neural networks.🚀
1
3
20
@edwardjhu
Edward Hu
3 years
In collab with @OpenAI , we used μTransfer to tune GPT-3 6.7B with only 7% of its pretraining compute by tuning a 40M-param model instead. Even better — as long as we underfit, wider is always better with the same HPs. Try it w/ your model!
Tweet media one
1
2
20
@edwardjhu
Edward Hu
1 year
Excited about sharing my story!
@CohereForAI
Cohere For AI
1 year
Join our Beginners in Research-Driven Studies (BIRDS) Group on Oct. 5th as they welcome @edwardjhu to present "Insights into LoRA, μTransfer, and the Art of Reasoning, Plus Valuable Advice for Beginners and More." Learn more:
Tweet media one
1
8
35
2
1
18
@edwardjhu
Edward Hu
1 year
I get to share this today thanks to my amazing coauthors @TheGregYang @ibab_ml @sidorszymon David Farhi, Nick Ryder, @merettm @AllenLao @WeizhuChen @JianfengGao0217 . Special thanks to @isafulf @motiwari2 and @karpathy for feedback on my first video!🙏
0
0
13
@edwardjhu
Edward Hu
3 years
We learn the updates to attn weight matrices using pairs of rank decomposition matrices with rank r. Turns out r=4 suffices even when the full rank is 12288! We verify LoRA on GPT-2/3 and outperform baselines including adapter and prefix tuning. Code:
Tweet media one
1
1
14
@edwardjhu
Edward Hu
4 years
We are honored to receive this award for our paper! Kudos to my awesome co-authors @hadisalmanX @adith387 @TheGregYang . Please tune in for our talk: and poster session: !
@NicolasPapernot
Nicolas Papernot
4 years
Improved Wasserstein Attacks and Defenses by J. Edward Hu (Microsoft Research AI); Greg Yang (Microsoft Research AI); Adith Swaminathan (Microsoft Research); Hadi Salman (Microsoft Research)
1
2
27
2
2
10
@edwardjhu
Edward Hu
1 year
Wanna hear the story behind LoRA and muTransfer, how they democratized generative AI, and new research on reasoning capabilities? Come join us next Monday 7/17 @ 1pm PT for a virtual fireside chat with @apbhatnagar from @pearvc !
0
1
10
@edwardjhu
Edward Hu
3 years
As always, kudos to my amazing mentor @TheGregYang , wonderful folks at OpenAI @ibab_ml , @sidorszymon , David Farhi, Nick Ryder, @merettm , and our awesome team at MSFT @AllenLao , @WeizhuChen , @JianfengGao0217 !
0
0
9
@edwardjhu
Edward Hu
1 year
- Great example of better reasoning through separating knowledge and inference in LLMs - System 2 reasoning = giving structure to System 1 computation (here it's MCTS) - LLMs are versatile knowledge model: just ask how confident / helpful an action is!
0
1
9
@edwardjhu
Edward Hu
2 years
LoRA is now officially supported in HuggingFace!🤗 so excited to see it being used for diffusion models😍
1
2
9
@edwardjhu
Edward Hu
1 year
Excited to see muP being extended to infinite depth as well!
@TheGregYang
Greg Yang
1 year
Nontrivial ∞width neural nets are either kernel machines or feature learners. Latter's scaling makes optimal hyperparams invariant to width What if depth→∞as well? 🆕 Feature diversity is key; maxed out by abs (not relu); gives invariance to depth! But GPT flawed 🧵
162
338
2K
0
1
8
@edwardjhu
Edward Hu
3 years
Big shout out to my amazing collaborators who made this happen: Yelong Shen, Phil Wallis, @ZeyuanAllenZhu , Yuanzhi Li, Shean Wang, and @WeizhuChen .💪🎉
2
0
7
@edwardjhu
Edward Hu
3 years
We also show how GFlowNets can estimate free energies and be driven by outcomes. We believe this will unlock new capabilities for DL. Joint work with Yoshua Bengio, @TristanDeleu , Salem Lahlou, @mo_tiwari , and @folinoid . Figures are from
0
0
7
@edwardjhu
Edward Hu
4 years
Clamping to [0, 1] is natural for Lp PGD attacks, but can produce unsound perturbations for Wasserstein attacks. We modify the projection algorithm to eliminate the need for clamping, which allows us to explore much stronger attacks and adversarially train against them.
Tweet media one
Tweet media two
1
1
6
@edwardjhu
Edward Hu
3 years
Why does such a low rank work for adaptation? We train LoRA modules with r=64 using 2 seeds -- the few top singular directions they learn largely overlap. Adaptation also amplifies snglr directions already important but not emphasized in the original weights. More in our paper.
Tweet media one
1
0
6
@edwardjhu
Edward Hu
1 year
👩‍💻The code to reproduce our experiments: Super grateful to my collaborators: @JainMoksh @EricElmoznino @you_kad @g_lajoie_ Yoshua Bengio, and Kolya Malkin!
0
2
6
@edwardjhu
Edward Hu
1 year
1
0
5
@edwardjhu
Edward Hu
3 years
Many thanks to @JianfengGao0217 , Jade Huang, Jiayuan Huang, @XiangLisaLi2 , @AllenLao , Yabin Liu, @ben_vandurme , Luis Vargas, Haoran Wei, @npew , and @TheGregYang who gave us valuable feedback on the draft!🙏
0
0
5
@edwardjhu
Edward Hu
1 year
Super lucky to have you as my mentor while you were there! 🙏💯
@TheGregYang
Greg Yang
1 year
What an incredible journey it's been over these 5+ yrs @MSFTResearch . I still remember the eureka moments, in the serenity of building 99 past midnight, leading to Tensor Programs & μP. Forever grateful for MSR taking a chance on a kid straight out of undergrad.
4
4
227
1
0
5
@edwardjhu
Edward Hu
3 years
We release all the LoRA checkpoints from our result as well as the scripts to reproduce them. The new result will be included in a revision of our pre-print soon. Check out the current draft here to learn about LoRA:
1
0
4
@edwardjhu
Edward Hu
1 year
🛠️We use GFlowNets, diversity-seeking RL algorithms, to fine-tune LLMs to match the reward distribution. Reward fn➕NN➕compute➡️sampler of the (intractable) reward distribution! 🥂 Trivial example: we train an LLM to generate random numbers when the base model and PPO fail.
Tweet media one
1
1
3
@edwardjhu
Edward Hu
3 years
Big shout-out to my great collaborator and mentor @TheGregYang , and Salem Lahlou and @megha_byte who provided helpful feedback on our post!
0
0
3
@edwardjhu
Edward Hu
1 year
🚧 Consider tasks such as infilling or finding good chains of thought (CoT). It’s easy to evaluate if a sample is good, but sampling from the (potentially multimodal) posterior distribution? Computational nightmare🙅‍♀️
1
0
1
@edwardjhu
Edward Hu
2 years
4 / One day, AI can learn to explain the world with even more complex compositional structures, e.g., graphs of causal relationships, like humans do. This will hopefully make AI much more robust in unseen situations because the underlying explanations often stay invariant.
2
0
2
@edwardjhu
Edward Hu
3 years
Shout-out to Lu Wang, Phil Wallis, and Yelong Shen for spearheading the NLU and NLG experiments!🚀 Once again thanks all who gave us valuable feedback on LoRA. @JianfengGao0217 @XiangLisaLi2 @AllenLao @ben_vandurme @npew @TheGregYang @SebastienBubeck @ZeyuanAllenZhu @WeizhuChen
0
0
2
@edwardjhu
Edward Hu
1 year
✅Solution: Amortized inference allows us to train neural networks to approximate those tricky distributions! Basically, we teach the model to “think” by sampling and aggregating CoT from the Bayesian posterior p(z|x, y) How is it done?😃
Tweet media one
1
0
2
@edwardjhu
Edward Hu
1 year
In addition, we can better infill short stories and extract more diverse and more likely next sentences from an LLM!
Tweet media one
Tweet media two
1
1
2
@edwardjhu
Edward Hu
1 year
🧠LLMs compress knowledge via next-token prediction. Therefore, tractable inference over this knowledge is limited to generating from start to end 📖➡️ What's the problem?🤷‍♀️
1
0
1
@edwardjhu
Edward Hu
1 year
📈 How is this useful? By learning to sample from the posterior over CoT, our method beats few-shot learning and PPO on simple arithmetic while generalizing better to harder questions! 🤖
Tweet media one
1
0
1
@edwardjhu
Edward Hu
3 years
@mattierialgirl Thanks! I've been on Slack for a while now :)
0
0
1
@edwardjhu
Edward Hu
4 years
@TheGregYang Learning from the best! 😊
0
0
1