Okay so I only discuss hot takes with Jason privately but this one I feel obligated to disagree publicly: Almost no one cares about levels/seniority in Gemini. Much of the real work and real decisions are made by ICs with experiment results & TensorBoards, not levels.
One liberating thing about OpenAI (and presumably other small companies) is that there are no expectations of project scope being tied to your level. What I mean is that an ambitious junior engineer could take on a big project and be judged purely on their execution, without any
Congrats to
@996roma
for defending her thesis, ascending to become the first ever
@Brown_NLP
PhD graduate! And as per Brown CS tradition, our lab presents her a chicken that is our best representation of her!
Levels may be important at Google, but not in Gemini. Probably easiest to just consider Gemini as a separate company/culture from Google. This is the best time ever for ambitious and productive ICs to thrive.
My co-authors have already highlighted the technical side of T0, so I will just chime in on the human side of things behind the scenes. (But really there are no “scenes” since even our meeting videos are public.) 1/7
Yet, our goal was to train a model that generalizes well to many more new tasks, not just some variant of QA.
The core hypothesis is that massive multi-task training on a large and diverse set of prompted datasets would improve generalization to unseen tasks.
In my first “research” meeting with
@qinan_yu
, we extensively discussed all the available evidence we had on how our PhD advisor
@Brown_NLP
met her husband (plus some history of LLMs). In just one short year, she has grown from my student/co-author/avant-garde friend to…
Case in point: the people who made our infra/models so good—they are not L8s/L9s (but some promoted to L8 now!) They are people who you hang out with at microkitchens/lunch & swing by their desks to ask questions everyday, not ppl who are in meetings 9 - 5
The teams working on model serving infrastructure at Google are really impressive. This is something I particularly enjoy about the Google 2.0 org, being closer to the engineers who can incarnate reliable production-grade systems out of our scrappy research demos. Building this
Also, you may be surprised that some L9s/L10s/VPs/SVPs/Google Fellows will spend hours with you every week looking at your code and TensorBoards and Colabs, even reviewing (like, actually reviewing not stamping) your code and running experiments themselves!
I was wrong about something on the Internet! Embarrassing, but it’s only right that we wrote another paper to right the wrong. Turns out that LMs are smarter than I thought, and humans are weirder than I thought.
Last year, we criticized LMs for performing “too well” with pathological prompts, and many papers have now shown similar results with corrupted ICL or CoT. In our new work, we find that *humans* also perform surprisingly well with irrelevant prompts! (But not misleading ones.) ⅕
I thought a lot about how to word my tweets, but words cannot describe how grateful I am to
@kelvin_guu
. He is an exceptional combination of knowing the literature, getting real work done, as well as being a trustworthy mentor. 1/4
Which training examples taught my LLM to do that? 🤔 New from Google Research: Simfluence tracks how much "smarter" your model gets after consuming each example. It can then simulate scenarios like “What if I removed X dataset from my training corpus?” 🧵
@tallinzen
@AndrewLampinen
(and I) are far from scaling maximalists, but his paper () is by far the most rigorous I know and shows (1) the correctness of the few-shot demonstrations do matter and (2) the instructions matter way less.
Also, a thousand thanks to my co-first-authors
@alyssamloo
and
@qinan_yu
, who are undergraduate students with competence far exceeding mine when I was a junior PhD. Both will apply to grad school soon so please hire them!
first authoring her own paper on mech interp (), and now a PhD in her own right
@stanfordnlp
!
Meanwhile, when I first taught
@alyssamloo
and graded her papers, I was astonished how this first-year student was already outputting PhD-level work,
and immediately after class, I poached her into our lab; and now after her graduation, I poached her into Google DeepMind (Gemini pretraining team)!
It’s somewhat cliche that highly accomplished people sometimes say their kids are their highest accomplishment, but in research,
We probably all had the experience of trying a thousand different prompt variations, maybe in a Colab with convoluted loops and string manipulations. Using
@hen_str
’s UI is much more sane, insightful, and refreshing!
🎉 NEW ! Finding a LLM prompt for new tasks can be tough. PromptIDE allows users to experiment with prompt variations, visualize prompt performance, and iteratively optimize prompts -- to appear
@ieeevis
2022.
Demo, Arxiv, Code:
#NLProc
❤️
#visualization
This figure needs a second to be digested but it’s a really cool & useful finding: Adding as little as 10% of few-shot training improves 0-shot eval, while adding 10% of 0-shot training improves few-shot too! A theme is to go beyond multi-task, but multi-prompting-paradigm too.
Q: But why are the results strong?
Our breakdown of the Flan Collection shows *why* it works. The most important methods:
🌟Finding 1🌟 Fine-tuning on zero-shot and few-shot prompts together significantly improves both settings (not a trade-off)!
4/
while I’m far from being accomplished, my highest accomplishment is not my models or papers, but my students. Many of my models, papers, and lectures were flawed & wrong in retrospect, but that’s why I teach, so that my students can do better research and correct me in the future
So… I accidentally made
@ShriramKMurthi
’s day? I can die—I mean—finish grad school happy now.
Also, me before taking his class: I should write unit tests more frequently than I floss.
Me now: write tests before writing the actual code.
incredibly patient with my opinions even though he is obviously much much more experienced. And of course so are my other co-authors. A huge thank you to all of you!
@tallinzen
@AndrewLampinen
Oh I trust Sewon et al. that there is nothing wrong with their paper per se. Could be as simple as the LMs and the datasets are different from Andrew’s. Plus, in our upcoming paper, humans also exhibit some similar behaviors, so best to not draw strong conclusions quite yet.
After an early experiment of naively training on all prompts failed, we needed to design a non-naive training and eval mixture. No one else seemed to want/have time to do it so I picked it up. I read through 700-something prompts and 200-something datasets and picked them by hand
Moral of the story: You don’t need fancy credentials
@BigScienceW
as long as you are willing to do good and honest work (many other members are busy!) and cooperate with others. Truly amazing that the LHC vision has kinda worked out! Join us here! 7/7
I rarely take pictures of myself, but I felt a moral calling when I saw these irresistibly cute probability distributions at the Data to Actionable Knowledge Lab at Harvard.
Had an epiphany that typing in all lowercase is evolving into the English equivalent of タメ口、whereas proper case & punctuation = 敬語。Like タメ口、all lowercase & omitting punctuations are no longer just a personal preference but socially preferred/required in some contexts.
Yes I’m so sorry for the confusion! I started the T0++ name in experiment code as a joke ref to C++, but we run out of time for ICLR so it accidentally became the official name. But we couldn’t have “+” in a model var name so we ended up with T0pp. 1/3
@_jasonwei
@stanfordnlp
Well actually,
@albertwebson
tells me that he gave the checkpoints some ice cream nicknames when he ported them to jax at google
T0 -> vanilla
T0+ -> strawberry
T0++ -> chocolate
When I was teaching myself programming in high school, I thought about 𝚜𝚎𝚕𝚏 a lot. Now that I’m a CS & philosophy grad student, I think about self a lot.
「big king」—my new definitive favorite restaurant in all of New England. A truly eye-opening experience at only $55 that rivals any $200 omakase in NYC or any ¥20,000 kaiseki in Kyoto.
I have probably driven
@SanhEstPasMoi
crazy by repeatedly changing the dataset mixtures,
@stevebach
on the prompt reviewing guidelines, and
@srush
on paper writing and figure making. Amazingly, not only did they not mute me, they added me as a co-first-author!
I won't get into a rabbit hole on “meaning” here, but suffice to say that T5 always had a special place in my heart and wow, what an honor to work with
@colinraffel
. Just an unbelievably nice person and…
Interlude: If some dataset decisions seem arbitrary, it's 90% likely simply because we didn't have the prompts or a particular datasets split ready at the time of training, or that they don't have GPT-3 baselines to compare to.
You, a normal person: There is a flying mammal in your apartment.
Me, who tries to keep a sleep schedule: Must submit my compute grid jobs before bed. The bat can stay for tonight while she waits for her asylum hearing.
Meanwhile, I myself keep referring to the non-plus one as “T0 Vanilla”, and the ++ one as, naturally, “T0 Chocolate”. When I first presented T0 to a Google team, they had just seen FLAN, and
@iftenney
joked that all LMs should be named after desserts. 3/4
Political euphemisms like “undocumented workers” and “illegal aliens” denote roughly the same group of people, but they carry extremely different connotations. Popular pretrained models, however, often conflate the two in pernicious ways. 1/4
The new mixture (what we called Experiment D3 then, now T0++) was a success, and after leading some error analyses (what
@BrownNLP
taught me well; much more on those analyses in a follow-up paper) and continuing participating in discussions (aka being annoyingly opinionated)…
Didn’t realize just how Very Online I am until, speaking to an international audience at my
#emnlp2020
Q&A, my on-the-fly examples of lexical semantics were all about the 2020 polling errors of hispanics as a monolith and the legislative history of the 2013 grand compromise.
Tangentially related: teams in the DC Think Tank Softball League also have very clever names. There is no official website so here is one roster from 2010. Obviously my favorite is Dynamic Scorers, bit of a deep cut here…
At a
@BayAreaNLP
Q&A of the Octopus Test of Bender & Koller 2020:
Audience: What if A and B speak different languages, say Japanese and German?
@emilymbender
: 行きましょう
@alkoller
: oh shit you know too much German now
“…four years of continuous exposure to the New England elements has left parts of the sculpture—includes painted and lacquered cast bronze… and a stainless steel interior framework—in need of restoration. The extent of the weathering effect on the sculpture was unanticipated.”
@YiTayML
@_jasonwei
I almost did this for our lab pre-pandemic. Depending on the city though it can be super expensive upfront to book a large Airbnb (with enough separate beds) so we need to plan it super early.
While people are learning to cut their own hair, I’m learning to sharpen my own knives. Handy that IRS Notice 1444, definitely not a piece of electoral campaign promotion, served as a unit test.
I was kinda proud of myself for having been all caught up with NLP Twitter for a week and still haven’t been scooped yet, and then I realized wait I have been looking at the wrong timeline. Anyway now is a good time to declare tweet bankruptcy and see you
@albert
@sigmoid
.social!
@ShriramKMurthi
@jtompkin
@mind_realms
I did not see this tweet when I took this screenshot. I was honestly just super impressed by James during CS143 final presentations.
@ShriramKMurthi
I agree that they’re often overdone in the literature but how would you rephrase this abstract? (Not trying to argue but seriously asking for advice here.)
Amazing that both our paper (which finds that LMs cannot understand instructions; ) and this paper (which finds that instruction-tuning improves zero-shot abilities) were out on arXiv on the same day! Some clarifications: 1/4
New paper:
We teach a 137B parameter language model to follow instructions using “instruction tuning”, which finetunes the model on a collection of tasks described via instructions.
This improves the zero-shot abilities of the model on unseen tasks.
I especially encourage people without a traditional CS background to apply. (I majored in political science and math in undergrad. No first-author paper.) AI & NLP would be fatally misguided if we don’t take domain knowledge from adjacent fields seriously.
I just want to debunk the very specific claim that multiple first-author papers are *necessary* for admissions to top PhD programs in NLP/AI, because I think this claim can be harmful (7/N)
@_jasonwei
Also as mentioned in my email to you and
@tallinzen
, I’m not sure whether we should have arithmetic as a necessary condition of general intelligence. Plenty of humans (esp. w/o formal education) cannot do arithmetic but it seems wrong to say they’re not generally intelligent.
@AndrewLampinen
Thanks so much Andrew! Needless to say, much of this paper’s discussion is heavily influenced by your group’s recent work. I also wanted to say more about your recursive grammar paper in Appx. A, although I haven’t figured out how to polish the writing to fit into the main text.
In an old draft of our lab logo, I deliberately broke the “do not resize elements” and “do not reconstitute text” rules because the official guideline has some questionable choices on spacing and alignment.
PS: When T5 first came out, I was super impressed by some of its task prefixes being in natural language (e.g., "translate English to German: ", "summarize: "). It has profound implications on the meaning of words like "summarize", orthogonal to whether the model is grounded.
Increasingly, the pressing research questions are no longer whether LMs are able to represent linguistic feature X, but exactly when and how do they actually use X. My labmate Charlie’s new paper sheds some light on this new direction. Take a look!
Checkout our new
@iclr_conf
paper w/
@tallinzen
: We found that models use “good” features (e.g. S-V agreement) over “bad” features (e.g. lexical overlap) when they’re easier to extract from the model’s pre-trained representations.
Paper:
(1/3)
A likely misinterpretation is that smaller models (our paper) can’t understand instructions while larger models (they claim ⩾ 68B) can. But note that our measurement of understanding does not rely on zero-shot performance but *few-shot learning speed* 2/4
Concretely, models trained with intentionally irrelevant prompts and pathologically misleading prompts can learn just as fast as the those trained with instructively “good” prompts. 2/4
I know we’re supposed to tweet about ICLR right now but sorry I need to take a moment and brag about why my department is awesome. Also what is even going on with
@jtompkin
and TWD here.
An example of why I love working at
@BrownCSDept
-- this is our second year of virtual chorus/musical sendoff to graduates. While it isn't how we'd like to celebrate graduation, it does show part of what makes us a fun community.
@sleepinyourhat
I have told/discussed this paper with every labmate/NLP friend of mine over the past year. Together with
@a_stadt
’s Roberta eventually prefers linguistic gen. paper, I think this is among the most fascinating questions in NLP right now and thank you all for writing them!