Long overdue but here's a new blogpost on training LLMs in the wilderness from the ground up ๐๐ง
In this blog post, I discuss:
1. Experiences in procuring compute & variance in different compute providers. Our biggest finding/surprise is that variance is super high and it's
New open source Flan-UL2 20B checkpoints :)
- Truly open source ๐ No forms! ๐คญ Apache license ๐ฅ
- Best OS model on MMLU/Big-Bench hard ๐คฉ
- Better than Flan-T5 XXL & competitive to Flan-PaLM 62B.
- Size ceiling of Flan family just got higher!
Blog:
Itโs been a short 6 months since I left Google Brain and it has been a uniquely challenging yet interesting experience to build everything from the ground up in an entirely new environment (e.g., the wilderness)
Today, weโre excited to announce the first version of the
We are excited to announce the 1st version of our multimodal assistant, Yasa-1, a language assistant with visual and auditory sensors that can take actions via code execution ๐ช.
Yasa-1 can understand text, images, videos, sounds & more! ๐
Check out more details below๐
Hot take ๐ฅ: Lots of buzz these days about new foundation open-source models but what if I told you there have been no real advance since 2019's T5 models ๐
Take a look at this table from this new InstructEval paper: . Some thoughts/observations:
1.
Over the past 3.3 years at Google, I have been blessed with so many wonderful friendships and experiences.
I have grown so much. However, itโs time to move on to a new adventure!
I wrote a blogpost about my wonderful experience here:
"Scaling laws vs Model Architectures" from
@GoogleAI
.
Lessons:
- Not all arch scale the same way.
- Vanilla Transformer does pretty well ๐
- Touching the attention too much is "dangerous". ๐
- Perf at base may not translate to large+ scale.
pdf:
It's been a wild ride. Just 20 of us, burning through thousands of H100s over the past months, we're glad to finally share this with the world! ๐ช
One of the goals weโve had when starting Reka was to build cool innovative models at the frontier. Reaching GPT-4/Opus level was a
Meet Reka Core, our best and most capable multimodal language model yet. ๐ฎ
Itโs been a busy few months training this model and we are glad to finally ship it! ๐ช
Core has a lot of capabilities, and one of them is understanding video --- letโs see what Core thinks of the 3 body
Weโre coming out of stealth with $58M in funding to build generative models and advance AI research at
@RekaAILabs
๐ฅ๐
Language models and their multimodal counterparts are already ubiquitous and massively impactful everywhere.
That said, we are still at the beginning of this
Inspired by the dizzying number of efficient Transformers ("x-formers") models that are coming out lately, we wrote a survey paper to organize all this information. Check it out at .
Joint work with
@m__dehghani
@dara_bahri
and
@metzlerd
.
@GoogleAI
๐๐
Excited to share our latest work at
@GoogleAI
on "Transformer Memory as a Differentiable Search Index"!
TL;DR? We parameterize a search system with only a single Transformer model ๐. Everything in the corpus is encoded in the model! ๐
Paper:
not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens
Decided to start a new blog series about model architectures in the era of LLMs. ๐
Here's part 1 on broader architectures like Transformer Encoders/Encoder-Decoders, PrefixLM and denoising objectives. ๐
A frequently asked question: "The people who worked on language and NLP
So many misconceptions about architectures (esp encoder-decoder vs decoder) partially due to nomenclature being confusing.
- EncDec, PrefixLMs, Causal Dec-onlys are *all* autoregressive. Even T5/UL2's objective is autoregressive.
- All 3 archs are not that different. People
A survey of LLMs with a practical guide and evolutionary tree.
Number of LLMs from Meta = 7
Number of open source LLMs from Meta = 7
The architecture nomenclature for LLMs is somewhat confusing and unfortunate.
What's called "encoder only" actually has an encoder and a decoder
"We offer no explanation as to why these
architectures seem to work; we attribute their success, as all else, to divine benevolence." - This has got to be my favorite line written in a paper ever (from ).
We are excited to share Reka Flash โจ, a new state-of-the-art 21B multimodal model that rivals Gemini Pro and GPT 3.5 on key language & vision benchmarks ๐.
We've trained this model from scratch and ground zero with a small (but amazingly capable team ๐งโโ๏ธ) and relatively finite
Introducing Reka Flash, our efficient and highly capable multimodal language model.
Try it at Reka playground ๐ for free today.
๐งต Thread, blog & links below ๐
New paper from
@RekaAILabs
๐ฅ (yes an actual paper).
This time we're releasing part of our internal evals which we call Vibe-Eval ๐ This comprises of a hard set which imo is pretty challenging for frontier models today.
The fun part here is that we constructed it by trying to
Efficient 64k context window!
New long range transformer that uses UL2 objective for long context few shot! ๐ฅ
Glad to advise on this work on ul2 training.
CoLT5: Faster Long-Range Transformers with Conditional Computation
Achieves:
- stronger performance than LongT5 with much faster training and inference
- SOTA on the SCROLLS benchmark
- strong gains up to 64k input length
Happy to share our latest paper from
@GoogleAI
,
@DeepMind
"ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning" where we scaled multi-task learning to 107 NLP tasks!
link :
Excited to share our new work from
@GoogleAI
and
@DeepMind
. "Charformer: Fast Character Transformers via Gradient-based Subword Tokenization (paper: )
My wife was planning our upcoming trip to Tokyo and was having a hard time with transport planning.
Jokingly, I asked her to ask ChatGPT to do it for us.
Within 1 min, ChatGPT found us an optimal & efficient route that we never came across while doing research.
Mindblown ๐คฏ
Introducing U-PaLM 540B!
@GoogleAI
Training PaLM w UL2's mixture-of-denoisers with only 0.1% more compute unlocks:
- Much better scaling ๐
- Emergent abilities on BIGBench ๐
- Saving 2x compute (4.4 million TPU hours!) ๐ฅ
- New prompting ability
link:
When comparing two models, a common reference point of compute is often used.
If you trained a 7b model with 3x the number of tokens/compute to beat a 13b model, did you really beat it? Probably not. ๐ถ
Here's a paper we wrote in 2021 () that I still
For many years I have always eagerly camped at the "gates of arxiv" to check out cool new stuff/papers. But this has somehow stopped being the case ๐.
- If there's a paper important enough to read, it will somehow "intrusively" appear in my face on twitter anyway.
- AI research
Bard announcement! ๐ฅ๐
We are working hard to bring the best large language models to the world!
Stoked and excited to be part of this, i.e., the Bard team.
1/ In 2021, we shared next-gen language + conversation capabilities powered by our Language Model for Dialogue Applications (LaMDA). Coming soon: Bard, a new experimental conversational
#GoogleAI
service powered by LaMDA.
Our
@RekaAILabs
Tech Report / Paper is out! ๐ฅ
Tech reports with completely no information are kinda boring so weโre revealing some interesting information on how we train our series of Reka models including tokens, architecture, data & human evaluation workflows. ๐
We tried
The False Promise of Imitating Proprietary LLMs
Open-sourced LLMs are adept at mimicking ChatGPTโs style but not its factuality. There exists a substantial capabilities gap, which requires better base LM.
Working idea but I've noticed a bunch of archetypes of AI researchers & engineers in my career. ๐ Here are some of them:
1. Carry: Hero-level person capable of making unprecedented (alone or in a small group). Either in terms of modeling, infra or making impact in general. Very
Community: Eval for LLMs are broken! Academic benchmarks are not representative of real world performance! ๐ โโ๏ธ. We need better evals!
Also the same community: Lets make definitive rankings & leaderboards based on just four zero-shot "LM harness" tasks!๐คทโโ๏ธ๐คทโโ๏ธ
Not wanting to single
Just a few years ago, research is mostly sorted by "applications". When folks asked what research you're working on, you're expected to say something like "oh I work in question answering" or "sentiment analysis" or something ๐ . In fact, all the conference tracks are sorted as
Happy to share that we have updated and published v2 of our "efficient transformer" survey!
Major updates:
โ Expanded our scope to sparse models and added a ton of new models!
โ Wrote a retrospective post about the advances in the past year.
Link:
๐จ New blogpost about my fav language/nlp AI papers of 2022.
10 awesome best papers + 22 interesting papers to read. Check it out! ๐
Hope it's helpful for everyone! ๐
Singaporean AI researchers are kind of rare. Over the years, many people have asked me if I know other Singaporean AI researchers working on the bleeding age.
Here's a thread introducing some of these Singaporean AI researchers who I know of that are doing amazing work!
๐
1)
New UL2 model/paper from
@GoogleAI
!
"Unifying Language Learning Paradigms"
โ SOTA on 50-ish diverse NLP tasks
โ Outperforms GPT-3 175B on 0-shot SGLUE
โ 3 x perf vs T5 XXL (+LM) on 1-Shot XSUM
โ Open code & 20B Flax checkpoints.
Paper:
Excited to share that we have released 170+ pretrained transformer checkpoints of many different shape & sizes as part of our
#ICLR2022
paper on "Scaling Transformers Efficiently" ๐.
Checkpoints:
Paper:
There are two major things that have happened in my life since 1 year ago. The 1st is now I am a co-founder of a globally distributed LLM startup. The 2nd is that I had a baby recently. Hereโs a typical day of my life ๐:
[3:00pm] Wake up officially.
[3:30pm] Stagger to desk and
Hey Singapore government ๐ธ๐ฌ if you're interested in LLMs, instead of "Le Model", we can build you "Model La".
Just 200 million dollars will do and would be light years ahead of anything you'll be able to train by yourselves. ๐คญ
New PaLM API launched! ๐ฅ๐ฅ๐ฅ
Feels amazing to be able to share your work with the world! ๐
Glad to have contributed to this massive team effort by helping to lead architecture and objective improvements for this latest generation of PaLM! ๐
Excited about PaLM API: an easy and safe way for developers to build on top of our language models, and MakerSuite, a tool to jumpstart prototyping - both in private preview today.
@googlecloud
customers can also access these models + more via Vertex AI.
i've gotten so many "how do u keep up with research" type of questions over the years. The answer is simple. You don't, you just sign up for an account for twitter and let the algorithm do the work for you.
if you don't see the paper on twitter, maybe it's for a good reason. ๐คฏ
Agreed. There's so many opportunities in AI now. It's a pretty suboptimal career choice to do a PhD at the moment.
Also, many outstanding AI researchers and hard carry engineers that I know of don't have an AI or CS PhD.
As PhD applications season draws closer, I have an alternative suggestion for people starting their careers in artificial intelligence/machine learning:
Don't Do A PhD in Machine Learning โ
(or, at least, not right now)
1/4 ๐งต
Glad to see Google crushing it! I've always maintained that Google is the best at LLMs & AI. It seems like this is not even their best model yet ๐. Congrats to all my friends for this well deserved victory. ๐
Over the past year, I've heard so many bad and grossly wrong takes
๐ฅBreaking News from Arena
Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to
@Google
for the remarkable achievement!
The race is heating up like never before! Super excited to see what's next for Bard + Gemini
Weโre beyond excited to share the first release of LegalBenchโa collaboratively constructed open-source benchmark for evaluating legal reasoning in English large language models.
๐
๐
friendly reminder to everyone that there isn't yet a good & proper systematic blind eval/benchmark of LLMs yet, especially those on real world data/use-cases.
if i were in academia this is something i'll work on immediately.
Sharing a piece of work I contributed to while at
@GoogleAI
: ๐
* a new improved Mc4 corpus (29T char tokens and 107 languages) that gets language sampling right with UniMax sampling.
* open source pretrained uMT5 models trained on 1T tokens.
* Unimax sampling solves some
UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining
Proposes a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages
Releases:
- an improved mC4 consisting of
research is an immensely taxing endeavour. hours spend doing IC work, debugging and what not. a paper is a canvas for researchers to express themselves after all the hard work, at the end of the day.
it's my art. at least let me paint the way i want to paint. The reason why i am
In retrospect, UL2 was a wild paper. ยซYeah so this just sort of cooked, turns out it's more optimal than Chinchilla, idk check it outยป. This almost farcically casual tone makes me suspect that Yi was thinking in detail about founding Reka at the moment.
Don't retrieve, recite!
Introducing Recitation-Augmented Language models "RECITE" from
@GoogleAI
by
@EdwardSun0909
.
RECITE is really powerful at knowledge intensive NLP tasks with its new recite-answer paradigm.
Check it out here:
1/N
Itโs been slightly more than a year since the UL2 paper () was released.
Hereโs a summary thread of some notable models/research papers that use the UL2 objective for training (aside from the original UL2/Flan-UL2 of course).
๐งต thread below
#1
-
Sharing "The Benchmark Lottery" from
@GoogleAI
&
@DeepMind
.
In this meta-paper (), we examine the challenges of ML benchmarking (e.g., model comparisons) and how it affects long-term progress. 1/
New paper alert! ๐ "Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers"
We study scaling laws of Transformers pertaining to both upstream & downstream transfer by pretraining over 200+ T5 models.
Paper:
@GoogleAI
@DeepMind
๐ here's me serving as living proof & live specimen that one can be 8400 miles away from SFO, in a completely non overlapping time zone, and still be right in the middle of all the action. ๐ฅ๐ค
The notion that "if you do AI, you have to be in San Francisco" is narcissistic BS. Easily >90% of the people who are pushing AI forward aren't located in SF. In fact it's likely >95%.
Check out our recent
@GoogleAI
"HyperPrompt" paper at
#ICML2022
.
TL;DR: Hypernetwork-learned task prompts outperform prompt tuning, adapters & hyperformer.
pdf:
Work led by
@YunHeHe17
&
@HuaixiuZheng
at Google Brain.
UL2 20B is a language model that trains on multiple objectives and can perform both language modeling & infilling.
Nice stuff:
1. Public checkpoints
2. Great at both fine-tuning & few-shot!
3. Works well with chain-of-thought reasoning.
Paper:
Introducing UL2, a novel language pre-training paradigm that improves performance of language models across datasets and setups by using a mixture of training objectives, each with different configurations. Read more and grab model checkpoints at
There are not many human beings in the entire world that have as much holistic full stack LLM experience as this man here. ๐ Tons of wisdom in these slides here by
@hwchung27
.
I gave a talk at Seoul National University.
I titled the talk โLarge Language Models (in 2023)โ. This was an ambitious attempt to summarize our exploding field.
Video:
Slides:
Trying to summarize the field forced me to think
I've been looking more closely into the evaluation based on human preferences in the draft Open Assistent (OA) paper, and I'm finding it's actually a really interesting case study in how tricky evaluation is... ๐งต
Even though
@_jasonwei
left Brain for OpenAI, did you know that Brain still has
@JerryWeiAI
, Jason's exceptionally talented brother! ๐ฅ
Research ability clearly runs in the family!
Check out this amazing thought-provoking paper led by Jerry Wei:
Recently, I went on my first podcast hosted by
@swyx
. ๐
It was a fun unfiltered 2 hour long conversation. Could have gone on longer but we got chased out of the studio.. ๐
Talked about a lot of stuff, i.e., reminiscing old stuff at
@Google
and newer stuff at
@RekaAILabs
.
๐ pod: The Yolo Researcher Metagame with
@YiTayML
!
OpenAI (ca. GPT4): ~600 people
Google Gemini: ~950 coauthors
@RekaAILabs
: 20 people
@sama
once speculated on the qualities of "10,000x AI researchers", and more recently
@_jasonwei
described the "Yolo
"Then comes FLANv2 โ very important, I may have read it more than ten times and suggest just memorizing the entire content". Wow.
This is yet another great blogpost by
@Francis_YAO_
that is ultra-meta but insightful/useful.
More more personal thoughts:
- Yes, +100 to "FLAN is
Bard knows Flan-UL2! Another model that was released just few weeks ago.
It's fresh and up to date! ๐ฅ
I can also tell it's conditioning on the blogpost I wrote. ๐ It's also accurate ๐คฉ
๐ฅ๐ฅ๐ฅ I am super excited and honoured to join forces with
@artetxem
(along with other amazing people).
It's gonna be amazing and incredible journey! ๐ชโจ๐คฉ
๐ข Life update ๐ข
After 2.5 wonderful years, I recently left FAIR to start a new adventure. It has been a privilege to be part of such an amazing teamโI have learned a lot and had so much fun!
I am super excited about what is coming next, and I hope to share more details soon!
Pretty cool idea!
Great to see Flan-T5 (despite being the smallest model here) hold it's ground pretty well ๐ฅ. It even outperforms other LMs like Dolly or StableLM.
Also another noteworthy point is that at "compute-match", Flan-T5 3B is equivalent to the cost of a 1.5B
Evaluating LLMs is notoriously difficult, and academic benchmarks may fail.
Inspired by chess and MOBA games, we are taking a new approach by calculating Elo ratings of models with crowdsourced battle data.
- Blog:
- Leaderboard:
Happy to share this new work on Generative Retrieval for Recommender Systems in collaboration with Youtube! ๐ซโจ
This paper draws inspiration from our Differentiable Search Index (DSI) paper which pioneered the generative retrieval paradigm for document retrieval. ๐
Wow this is a great technical lecture by
@hwchung27
. ๐
Really glad someone finally dived deep into that encoder-decoder / decoder discussion! ๐
I think not many people understand the intricacies of this topic, and these days many people don't even know what "input" and
I gave a lecture at
@Stanford
CS 25.
Lecture video:
AI is moving so fast that it's hard to keep up. Instead of spending all our energy catching up with the latest development, we should study the change itself.
First step is to identify and understand
Introducing Vit-22B!
Vit-22B is the largest dense Vision transformer ever trained ๐ฅ.
It's time for vision to catch up to language in the scaling game!
I am excited to find out what other emergent abilities can be found by scaling up vision models!
1/ There is a huge headroom for improving capabilities of our vision models and given the lessons we've learned from LLMs, scaling is a promising bet. We are introducing ViT-22B, the largest vision backbone reported to date:
Wow! ๐ถ Not cool meta!
Gotta admit the first thing I looked for in the llama-2 paper was
@GuillaumeLample
on the author list.
PS: Google always retained authors on the papers even after they left the company. Don't be evil!
When you think you found a witty rebuttal to a popular paper only to find out that your ideas have been already scooped by the original paper itself. ๐ฌ
one step ahead bro.
Disclaimer: I have not read this fancy "mirage" paper in detail but here's an excerpt from the original
Are Emergent Abilities of Large Language Models a Mirage?
Presents an alternative explanation for emergent abilities: one can choose a metric which leads to the inference of an emergent ability or another metric which does not.
Happy New Year / New Years Eve! ๐ฅณ(depending on where you are)
Here's a thread of me reflecting on 2022 and all the research I've contributed to this year and stuff I did.
๐งต quite a long chronological thread below ๐
My first blog post on "emergence, scaling and inductive bias"!
This is a medley discussion piece of some of our recent works on Emergence, U-PaLM, scaling laws vs models, CoT vs inverse scaling and more!
Check it out:
As a companion to our recent efficient Transformer survey, we designed "Long Range Arena" a new challenging benchmark to help understand and analyze trade-offs between recent efficient Transformer models. Check out our paper at .
@GoogleAI
@DeepMind
I really enjoyed my casual Saturday morning coffee chat with
@hwchung27
. Tons of technical wisdom and fun.
With respect to life he basically said "extreme experiences are way more valuable even if they are hard" just like how training on only easy examples don't produce
Someone did a vibe check comparison of GPT-4, Claude-3, Gemini Advanced, Mistral Large and Reka Flash.
I think Reka Flash did pretty well for a 21B model ๐.
Damn ๐ฅ๐ฅ This is such a big deal!
MHA vs MQA has been always hotly debated and I've always felt this was the right "de-risked" way to go about it.
Congrats to
@michielsdj
to the great & impactful work!
New paper! Multi-query attention trades quality for speed and requires training a new model. Instead uptrain improved MQ variant from existing multi-head model!
Work with Joshua Ainslie, James Lee-Thorp,
@_theopompus
, Federico Lebron, Sumit Sanghai.
Now this is the type of excellent work that the community needs more of!
Everyone cranking out minute-of-fame "model distillation" papers should take a look at this fine exemplar of good science below: ๐
great work
@EdwardSun0909
Move over Alpaca, IBM just changed the game for open-source LLMs ๐ฅ
Dromedary๐ช, their instruction-tuned Llama model, beats Alpaca in performance ๐ฌ๐๐ฉ๐๐ค๐ช๐ฉ distilling ChatGPT, and ๐ฌ๐๐ฉ๐๐ค๐ช๐ฉ human feedback! How do they do it? ๐
(1/4)๐งต
most solid architecture is the "Noam" architecture. stop calling it a llama or whatever. this is the Noam transformer. (you can call it PaLM architecture too!)
Just realized today I have almost the same number of twitter followers and citations ๐ but missed the moment I had the exact balance. This should have been a few hours ago. Dang!
Introducing Gemini 1.0, our most capable and general AI model yet. Built natively to be multimodal, itโs the first step in our Gemini-era of models. Gemini is optimized in three sizes - Ultra, Pro, and Nano
Gemini Ultraโs performance exceeds current state-of-the-art results on
In the spirit of being very meta here. Here's my personal meta-review of all the leaderboard-ing methodologies.
1. I like the elo ranking based on chatbot arena from
@lmsysorg
2. LM harness (e.g., zero-shot PIQA, Hellaswag etc) is the equivalent of "MNIST" for LLMs. Okay-ish
so many problems i don't know where to begin.
- yea put sparse and dense models in the same plot with the # params. good job ๐
- i'm sure you know the size of palm-2 and gpt-4. ๐
- fwiw, t5 is still one of the best LM models out there. it started way earlier than 2021.
-
Proliferation of LLMs. Some highlights:
1.
@Google
started early in 2021 with LaMDA and FLAN
2. Now
@Google
,
@OpenAI
, and Chinese players are actively competing on the top half of chart
3. The bottom of chart is dominated by the open source community, with impressive output speed
PokemonChat: Auditing ChatGPT for Pokรฉmon Universe Knowledge
paper page:
probe ChatGPT for its conversational understanding and introduce a conversational framework (protocol) that can be adopted in future studies. The Pok'emon universe serves as an
ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks
ChatGPT outperforms crowd-workers for several annotation tasks, including relevance, stance, topics, and frames detection, w/ 20x less cost.
Why aren't any recent LLMs (OPT, PaLM, etc.) using "efficient" architectures (Reformer, Longformer, etc.)?
There's 20+ of them, and they've been around since 2020! Are they actually *not* more efficient?
Interesting paper from my ex-colleagues at
@GoogleAI
led by
@vqctran
. Generative retrieval (i.e., DSI) is one of the most fun works I've worked on (and pioneered) during my Google career.
Also,
@vqctran
is driving a lot of the agenda that we worked on together back then. He has
How Does Generative Retrieval Scale to Millions of Passages?
Finds that the use of synthetic queries as a document representation strategy is the only approach that remained effective as they scaled up the corpus size using MS MARCO passages.