I gave a lecture at
@Stanford
CS 25.
Lecture video:
AI is moving so fast that it's hard to keep up. Instead of spending all our energy catching up with the latest development, we should study the change itself.
First step is to identify and understand
I gave a talk at Seoul National University.
I titled the talk “Large Language Models (in 2023)”. This was an ambitious attempt to summarize our exploding field.
Video:
Slides:
Trying to summarize the field forced me to think
Here is my talk at
@MIT
(after some delay😅)
I made this talk last year when I was thinking about a paradigm shift. This delayed posting is timely as we just released o1, which I believe is a new paradigm.
It's a good time to zoom out for high level thinking.
(1/11)
I gave an invited lecture on Instruction finetuning and RLHF for
@hhexiy
's class at NYU.
One unique perspective of my lecture is that I introduce RLHF as an instance of using a learned objective function.
Video:
Slides:
We are hiring in the ChatGPT team! Happy to chat about this position. DMs are open.
Instead of your papers, I’d love to learn about the most difficult technical problem you worked on and your lessons. It doesn’t have to be ML.
I value exceptional technical skill a lot more than
Our team at OpenAI is hiring! We're looking for engineers/researchers who do rigorous and thoughtful work understanding and evaluating LLMs like ChatGPT.
If you're interested, please apply online and DM me with work that you've done!
An interesting confounding factor in comparing these models is that training details really matter.
For Flan-T5, resetting the Adafactor optimizer states during instruction finetuning was the biggest factor. It increased the MMLU almost double digit from 43 to 52. This was
Hot take 🔥: Lots of buzz these days about new foundation open-source models but what if I told you there have been no real advance since 2019's T5 models 😀
Take a look at this table from this new InstructEval paper: . Some thoughts/observations:
1.
Many visionaries talk about future. But talking to
@sama
is another level. It feels like he is already in 2030 and talks back at me.
Then thinking about future becomes an "interpolation" between where I am and where he is, as opposed to extrapolation into the wild
Can't think
Research code that doesn’t make readers feel dumb is great. Too often, code is written to showcase the author's advanced knowledge of the language/framework, which overwhelms the reader.
Researchers come to the code with other thoughts/hypotheses in mind. The mental bandwidth is
New paper + models!
We extend instruction finetuning by
1. scaling to 540B model
2. scaling to 1.8K finetuning tasks
3. finetuning on chain-of-thought (CoT) data
With these, our Flan-PaLM model achieves a new SoTA of 75.2% on MMLU.
New open-source language model from Google AI: Flan-T5 🍮
Flan-T5 is instruction-finetuned on 1,800+ language tasks, leading to dramatically improved prompting and multi-step reasoning abilities.
Public models:
Paper:
I've been experimenting with Test Driven Development with GPT-4.
I first write test cases to formalize the desired behavior, then ask GPT-4 to write a function and suggest additional tests if needed.
I've found this method more efficient than writing the function first and then
2020 at Google
Michelle (manager): Do you want to be a mentor for an incoming resident?
Me: Hmm not sure if I am qualified
Michelle: Yes you are
Me: Ok I will try
A month later
mentee: Hi I am Jason Wei. I just joined
2023 me: "This year I am especially thankful for Michelle"
This year I am especially thankful for
@hwchung27
, who has been my closest collaborator for more than a year now.
I have many good things to say about Hyung Won, but to me his most salient trait is original thinking. I would describe his thinking style as highly logical, based
Random life hack to fall asleep quickly.
Pick a recursion or backtracking problem that is non-trivial. Run through test cases in your head.
This will quickly max out your working memory. Your brain will beg to be shut off. And you will be in sleep.
TLDR; make your brain OOM
I see many people self-impose imaginary rules that hold them back from achieving more.
When I first started working on deep learning, I admired Noam’s work on scaling and wanted his advice. But I imposed an imaginary rule: “I have to be good enough not to waste his time”. So I
Machine unlearning is important but human unlearning is equally so, especially for LLM researchers.
Without a strong theoretical framework to guide us, LLM researchers heavily rely on intuitions formed from empirical observations.
The emergent abilities of LLMs, however, mean
Many find this crazy but I use a single screen workflow. WHY? Fingers🖐️ are faster than head/eyes 👀.
You move your head and/or eyes to switch between monitors. With keyboard shortcuts + multiple desktops, I always stare at the same thing and only my fingers move. Much faster!
A counterintuitive implication of scale: trying to solve a more general version of the problem is an easier way to solve the original problem than directly tackling it.
Attempting a more general problem encourages you to come up with a more general and simpler approach. This
Not having a strong ego is pretty useful.
- I don't fear becoming a beginner again.
- In fact, I like being below-average in the room as my rate of learning is likely above-average.
- I am fine with working on ideas that I didn't come up with. I just want to work on the most
@tszzl
I don’t think this is specific to ai. People have tendency to underestimate the changes in the future despite having witnessed substantial changes in the past
Happy to release:
1. upgraded mT5 checkpoints:
2. refreshed mC4, a multilingual pre-training dataset:
The new mC4 covers CommonCrawls in 101 languages up to Aug. 2022
3. And a new ICLR paper:
A Korean character is formed by combining consonants and vowels in various ways. So one way to corrupt a character is to add an unnecessary consonant (e.g. ㅅ). The resulting combination is so unnatural to Koreans that they can automatically undo this change.
This is
The biggest surprise from working on the Flan project was how good Flan-T5-XXL was for its size.
However this model was less accessible because it requires some knowledge of model parallelism.
Happy to see tutorials like this, which makes XXL model more accessible!
🚨Attention
#NLP
enthusiasts!
We just published a new blog post on how to fine-tune FLAN-T5-XXL using DeepSpeed & Hugging Face Transformers! 🚀
👉
We ran a series of experiments to help you choose the right hardware setup.🤖💻
Last week marked 1 year at OpenAI. Reflecting back, I think the most unique aspect of OpenAI is the importance of mission, which seems to be less emphasized elsewhere. To be honest, I didn’t realize this either when I first joined. Now I believe mission is critical because:
1)
For research, it is more important to deeply understand the basics and have the right perspective than to dive into fancy ideas.
In this lecture, Jason shares how he thinks about language models. I find it so unique and insightful that I sneaked into Stanford to listen 🥷
It was an honor to give a guest lecture yesterday at Stanford’s CS330 class, "Deep Multi-Task and Meta-Learning"!
I discussed a few very simple intuitions for how I personally think about large language models.
Slides:
Here are the six intuitions:
(1)
Being brutally honest with oneself is difficult, especially when it requires facing harsh reality. Here is how I strive for self-honesty. I observe myself as if I were a ghost floating above. And in doing so, I replace the subject from “I” to “this monkey". For example
Inner
towards intelligence too cheap to meter:
15 cents per million input tokens, 60 cents per million output tokens, MMLU of 82%, and fast.
most importantly, we think people will really, really like using the new model.
An extended version of the 3min video from last week!
If you're interested in
@OpenAI
's research but weren't sure how it feels to work here, this is the closest thing.
It shares what researchers value (e.g. challenges involved in scaling), what they
Recently the level of stress has been creeping up quite a bit. In general, I love scaling beyond measure but this is not the thing I want to scale. So I did some introspection.
As I am working in a field that is advancing exponentially, the range of outcomes is getting larger
When working on intellectually challenging problems, I often notice that I have subconsciously closed my eyes. It's as if my mental capacity is reaching its limit, and my brain is desperately freeing up cognitive resources by eliminating unrelated signals like visual stimuli.
“Is what I am working on irrelevant?” has been one of the most useful questions for my career.
Being extremely honest in answering that requires courage but it increases the chance of working on the right thing, which matters more than how good I am
And I ask this very often
Excited to present a new ICLR paper from Google Research and DeepMind:
Our key contributions:
- New insights on creating mode parameter-efficient and transferable models via embedding decoupling
- RemBERT, which outperforms XLM-R and mT5-Large
Learning, if defined from first principles, shouldn't need to assume that a student is of a particular type (human, monkeys, machines).
I believe that machines are now capable enough that the education for humans and machines is converging!
“Don’t teach, incentivize” is a great concept that applies to both machines and humans. Huge credit to
@hwchung27
for being able convey a lot of wisdom in very few words.
Working in the field, positive surprises are pretty rare. But this one surprised me. Wow.
Having a hard time thinking about the implications of text-to-video when it improves 100x from this point 🤯
Introducing Sora, our text-to-video model.
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
Prompt: “Beautiful, snowy
One reason why encoder-decoder/decoder-only is still confusing is that not many people need to implement Transformers from scratch these days.
There is a level of understanding that can only be achieved by struggling to implement something from scratch. I highly recommend this!
Wow this is a great technical lecture by
@hwchung27
. 😄
Really glad someone finally dived deep into that encoder-decoder / decoder discussion! 😄
I think not many people understand the intricacies of this topic, and these days many people don't even know what "input" and
Compute + data + transformer doesn’t automatically lead to a good model.
It needs people who systematically suffered debugging these models at various scales.
@YiTayML
has suffered enough. So I expect some good models.
We’re coming out of stealth with $58M in funding to build generative models and advance AI research at
@RekaAILabs
🔥🚀
Language models and their multimodal counterparts are already ubiquitous and massively impactful everywhere.
That said, we are still at the beginning of this
In 2013, I took a class taught by Prof Strang. At the time he had been teaching at MIT for 52 years.
He continued teaching for another 10 years. Yesterday he gave his final lecture
He taught me how to like Linear Algebra and how invaluable teaching is.
“how many r’s in strawberry?”
I had to ask this to demo our new model o1-preview 😎
LLMs process text at a subword level. A question that requires understanding the notion of both character and word confuses them.
OpenAI o1-preview "thinks harder" to avoid mistakes.
Long term thinking is surprisingly rare even in places like Silicon Valley. One of the causes is that even if you are working towards long term impact, the day-to-day often feels incremental and mundane.
I find it really useful to practice zooming out of the incremental progress
People don’t like to repeat because they don’t feel like making progress. But repetition is necessary for deeper understanding. E.g.
- Re-reading books
- Repeating the thought process understanding a new concept
Unfortunate side effect of over-reliance on quantitative metrics
Jason walked into the classroom without anything (no laptop, no notes) and gave a lecture out of memory.
I felt so glad that I refused to also give a blackboard lecture.
As a kid I loved whiteboard lectures way more than slides, so for Stanford’s CS25 class I gave a whiteboard lecture!
My goal was to simply and clearly explain why language models work so well, purely via intuitions.
Youtube video: (w/
@hwchung27
)
“[9:45 am] Recite OpenAI charter. Pray to optimization Gods. Learn the Bitter Lesson”
This has it all. Think about AGI, drop the “scientist ego” and seek divine benevolence.
This is AI research at its core
My typical day as a Member of Technical Staff at OpenAI:
[9:00am] Wake up
[9:30am] Commute to Mission SF via Waymo. Grab avocado toast from Tartine
[9:45 am] Recite OpenAI charter. Pray to optimization Gods. Learn the Bitter Lesson
[10:00am] Meetings (Google Meet). Discuss how to
Finally found time to read this blog post.
For researchers, fellow researchers are like customers. Learning that my research affected other researcher in such a positive way is the best customer feedback.
This made my day!
A lot of AI research has shifted from “building” models to “using” models. Creativity and curiosity play much bigger roles in this new era.
Not sure about creativity but you can complement curiosity to some extent. Think what your curious friend would have done in a given
A saturated benchmark gives a false impression that the underlying progress is slowing down.
Benchmarks are proxy for what we care about, which are often hard to measure. When they are saturated, they are useless and even misleading.
A good model satisfies users’ prompts. A great model changes the types of prompts by expanding what is possible.
Benchmarks like LMSYS provide good insight but they can't measure the latter. We should at least be aware of it. Otherwise, we incentivize incremental progress
I am very excited that the MedPaLM paper is now published in
@Nature
It is a great way to invite the broader scientific community to LLMs. I feel like LLMs are more adopted in the general public than in the scientific community. There are just so many
I'd like to clarify a few points on this slide from my previous talk to avoid potential confusion ()
1) As cited in the slides, this function is adapted from Noam's multiquery paper, which I highly recommend. This is the best resource to learn about
In an empirical research field such as deep learning, willingness to discard one’s own hard work is crucial. Try out a bunch of approaches, ruthlessly prioritize and trim less promising directions.
But in practice it is hard every time. Good bye my dear code 😞
I strive not to be too organized because doing so misses a lot of deep lessons that tend to compound in the long run.
I sometimes work on things that don’t generate output for some time. From a highly organized person’s perspective, I am not being “productive” and this is a
Compression begets clarity.
- Kill 90% of Chrome tabs at the end of each day. What do you leave open?
- Summarize the entire field of LLMs into a 50-min talk.
- What is one foundational principle behind every major AI breakthroughs?
- If you could recommend only one book, what
Just like some books have an audiobook version, I'd love to see an LLM version of a book.
Books represent a unidirectional mode of knowledge transfer. With "LLMs for books"—maybe achieved through fine-tuning or in-context learning—the knowledge transfer could become
I'm getting used to AI surpassing me in more areas, much like how I trust Google Maps over me.
Even two years ago, it was so easy to look at the model generation and grade it myself. Now it is quite difficult for some domains (e.g. GPQA eval). Such a humbling experience.
Such an honor to have an opportunity to work together and learn from these researchers!
This video doesn't show all the great people who worked on this project. Please check out
We have reached an agreement in principle for Sam Altman to return to OpenAI as CEO with a new initial board of Bret Taylor (Chair), Larry Summers, and Adam D'Angelo.
We are collaborating to figure out the details. Thank you so much for your patience through this.
Flan2 paper is now on JMLR, 1.5 years after the initial arXiv release. It already feels quite dated, reflecting how fast the field is moving.
That said Flan-T5 series is still going strong, with an astonishing 52M cumulative downloads 🤯
How are people using these models?
I will be on the panel for the Instruction following workshop at
#NeurIPS2023
. Of course I will interpret everything from scaling perspective 😎
Today 10:45-11:30am, Room 220-222
Great talk by
@hwchung27
. I really like the interesting analogies here (he's great at that!).
My favourite one is "no amount of bananas will incentivize monkeys to do mathematical reasoning" 🤣
Many people learn about the tools just enough to get the job done. I prefer to dive deeper; understanding my tools in detail makes my work much more fun.
Not sure if it's good or bad. Just more fun! Perhaps that’s what truly matters in the end.
I titled the talk “Don’t teach. Incentivize”.
We can’t enumerate every single skill we want from an AGI system because there are just too many of them.
In my view, the only feasible way is to incentivize the model such that general skills emerge.
(3/11)
Additional benefits of pair programming:
1. When I think deeply about a problem, in my head I make logical jumps and stitch thoughts together in an incoherent manner. I am very generous to myself when it comes to such logical flaws. Often the implication of this is uncovered
Pair programming isn’t standard at most companies and basically non-existent in academia, but I’ve been doing it with
@hwchung27
for almost a year now. While it naively seems slower to code individually, I’ve realized that there are many benefits:
(1) In AI, what you work on can
Many people fear reading because if they fail to understand what they are reading, it doesn't feel good and can even hurt their ego. If this unpleasant experience happens repeatedly, they avoid reading, as it becomes associated with negative rewards.
Take "Googling" as an
Here is how AI can revolutionize education.
1. AI estimates the capability of a student (human, AI, etc)
2. It consistently provides materials say 0.1% beyond current capability. Consistency is the key; learning exponentially compounds.
3. Scale to all
Can’t fathom the impact
Today I am pleased to announce the new board of directors for my relationship.
The new board of directors will be:
1. My mom
2. My girlfriend’s sister
3.
@hwchung27
, who I pair program with frequently
4. Bret Taylor (we’ve only met once, but every board should have Bret Taylor)
I intentionally did not fix the broken copilot for a few days because that makes me more grateful for what I take it for granted.
Remove what I use all the time and later when it put it back I realize how great that thing has been.
@_jasonwei
i invite you to drop cursor and use
One time I was pair programming with
@hwchung27
, and his github co-pilot extension was broken so he was manually typing every word. What an awful experience, it was like watching my granddad typing on apple notes on his iphone 7
Nice reminder for how quickly we have used AI to
A few times I found myself questioning my own judgment when disagreeing with GPT-4. This reminded me of Google Maps; I began to trust its guidance more than my instincts once it crossed a certain threshold.
GPT-4 is posed to usher in significant shifts in our perception of AI
One of the most important aspects of Flan is its generality. This paper extends that further; instruction finetuning benefits "single-task" finetuning as well.
You can further finetune Flan-T5 on your custom tasks and that is likely better than finetuning T5!
✨New Paper✨What’s the best completely public competitor to
#ChatGPT
?
Flan-T5 beats all public models we tested:
Flan-T5 3B ▶️ T0++ 3B ▶️ OPT-IML 175B ▶️ GLM-130B ▶️ Flan 2021 3B ▶️ NIv2 3B
We release the
@GoogleAI
🌟Flan Collection🌟data + methods for Instruction Tuning!
1/
With macOS, I use multiple desktops each with a shortcut
- option-1 to get to desktop 1 for project 1
- option-5 to get email/calendar
I also use
@apptivateapp
to set
- option-t for iTerm (Vim ftw!)
- option-s for Slack
- option-c for Chrome
As the field matures, it becomes rarer to build something from scratch. So difficulties associated with such endeavor is often overlooked.
Huge congrats for
@YiTayML
and the team for achieving this milestone so quickly!
We are excited to share Reka Flash ✨, a new state-of-the-art 21B multimodal model that rivals Gemini Pro and GPT 3.5 on key language & vision benchmarks 📈.
We've trained this model from scratch and ground zero with a small (but amazingly capable team 🧙♂️) and relatively finite
Manually examining the data and model output is a great way to deeply understand the problem.
It is like lubricating the brain. It reduces the friction in thinking within the domain; I can think faster and make deeper reasoning steps.
This could mean the difference between
One pattern I noticed is that great AI researchers are willing to manually inspect lots of data. And more than that, they build infrastructure that allows them to manually inspect data quickly. Though not glamorous, manually examining data gives valuable intuitions about the
In Dragon Ball, there is “Room of spirit and time”. You train one year inside the room and it is only a day outside. The multiplier is 365.
For machines it is a lot higher. So a strong generalist with more compute is often better at special domains than specialists.
(10/11)
I hope this lecture sparks interest in high level thinking, which will be useful in building better perspectives. This in turn will lead to finding more impactful problems to solve.
Thanks
@hjterrysuh
and MIT EI Seminar for hosting me!
(11/11)
Great to see such detailed descriptions of challenges training large models from scratch. Such knowledge is extremely valuable and scarce. Hope more people share their unique experience!
Long overdue but here's a new blogpost on training LLMs in the wilderness from the ground up 😄🧐
In this blog post, I discuss:
1. Experiences in procuring compute & variance in different compute providers. Our biggest finding/surprise is that variance is super high and it's
Thanks, this totally justifies my struggle getting the optimal amount of squiggles 😅
I am obviously biased but pretty much all learning-based AI should be understood with this plot as the unified perspective!
I really liked that this podcast is unfiltered. Feels raw but when a lot of videos on internet are highly polished, this raw feeling stands out to me.
As always Yi is very transparent about sharing his experience, which is really helpful for those who want to learn about AI
Recently, I went on my first podcast hosted by
@swyx
. 😄
It was a fun unfiltered 2 hour long conversation. Could have gone on longer but we got chased out of the studio.. 😅
Talked about a lot of stuff, i.e., reminiscing old stuff at
@Google
and newer stuff at
@RekaAILabs
.
An analogy I used is extending the old saying:
"Give a man a fish, you feed him for a day. Teach him how to fish, you feed him for a lifetime."
I go one step further and solve this task with an incentive-based method:
"Teach him the taste of fish and make him hungry."
(6/11)
Try Code Interpreter! One use case for me is data visualization.
This figure took me 3+ hours to manually plot with matplotlib. It was a pain.
With Code Interpreter, I can probably get it done in 10 min.
But this is just a simple use case. I am excited to see how people
Code Interpreter will be available to all ChatGPT Plus users over the next week.
It lets ChatGPT run code, optionally with access to files you've uploaded. You can ask ChatGPT to analyze data, create charts, edit files, perform math, etc.
Plus users can opt in via settings.
We’re hosting an AMA for developers from 10–11 AM PT today. Reply to this thread with any questions and the OpenAI o1 team will answer as many as they can.
Congrats to
@YiTayML
and the reka team on this launch!
In the tech report i see this huge spike in the loss curve. Hope you did not lose much sleep when that happened
@YiTayML
Our
@RekaAILabs
Tech Report / Paper is out! 🔥
Tech reports with completely no information are kinda boring so we’re revealing some interesting information on how we train our series of Reka models including tokens, architecture, data & human evaluation workflows. 😃
We tried
New talk from
@hwchung27
about how to think "meta-level" in AI research. I have been impressed by Hyung Won's ability to identify new paradigms and totally give up any sunk cost. In late 2022 he realized the power of RL and has been preaching it ever since
A fun story: when
@YiTayML
Flan-UL2 is trained with prefix LM objective much more than the Flan-T5. The benefit might not be well-captured by the academic benchmarks (they don't require long-form generation) but the "model usability" of Flan-UL2 will probably be better
@zhansheng
It doesn't help for all cases. I have seen a few cases where this actually hurts mildly.
Here is my (very unscientific) intuition. Not-resetting the states is good if you are finetuning on a task that is "similar" to pretraining. For example, SuperGLUE tasks have at least
When the petition started, the google doc exploded due to traffic. I felt pretty anxious not being able to sign. Being alone without peers in Korea certainly did not help. I’d say it was more of FOMO than "peer pressure" for me.
not to longpoast, and I can only speak for myself, but this is a very inaccurate representation of the mood from an employee perspective
- “employees felt pressured” -> at some point hundreds of us were in a backyard learning about the petition. people were so upset at the
If you try to solve tens of tasks with minimal effort possible, then pattern-recognizing each task separately might be easiest
If you try to solve trillions of tasks, it might be easier to solve them by learning generalizable skills, e.g. language, reasoning, etc.
(5/11)
I believe that
1. the energy to desire is finite
2. the more you desire something, the higher the chance of achieving it
Corollary: ruthlessly reduce the number of desires in order to increase the chances of achieving what truly matters to you.
It’s been a short 6 months since I left Google Brain and it has been a uniquely challenging yet interesting experience to build everything from the ground up in an entirely new environment (e.g., the wilderness)
Today, we’re excited to announce the first version of the
You might think that it takes too long to teach via the incentive instead of direct teaching. That is true for humans, but for machines, we can give more compute to shorten the time.
In fact, I'd say this "slower" method allows us to put in more compute.
(8/11)
Leverage dilemma: if you are truly leveraged, you benefit greatly even if you don't work hard. But if you do work hard, the additional benefit will be so significant that it is too costly not to work hard
This blog explains pretraining objectives and Transformer architectures. Studying these old ideas tells us long term consequences of research decisions.
I believe such lesson is more important than knowing a lot of recent advances whose long term consequences we don't know yet
Decided to start a new blog series about model architectures in the era of LLMs. 😀
Here's part 1 on broader architectures like Transformer Encoders/Encoder-Decoders, PrefixLM and denoising objectives. 😄
A frequently asked question: "The people who worked on language and NLP
Received an overwheling number of DMs and emails. So the processing has been slow. We are going to read all of them today and this weekend. Thanks for your interest.
We are hiring in the ChatGPT team! Happy to chat about this position. DMs are open.
Instead of your papers, I’d love to learn about the most difficult technical problem you worked on and your lessons. It doesn’t have to be ML.
I value exceptional technical skill a lot more than