tldr; you can go a long way in pre-training by (1) curating amazing data, (2) using a lot of FLOPs, and (3) otherwise not screwing up. All three are harder than they sound, so read the paper... That said, I'm amazed by our progress since Llama 3 - expect big things from Llama 4!
So excited for the open release of Llama 3.1 405B - with MMLU > 87, it's a really strong model and I can't wait to see what you all build with it! Also check out the paper here, with lots of details on how this was made:
New paper in Science today on playing the classic negotiation game "Diplomacy" at a human level, by connecting language models with strategic reasoning! Our agent engages in intense and lengthy dialogues to persuade other players to follow its plans. This was really hard! 1/5
Excited to share a preview of Llama3, including the release of an 8B and 70B (82 MMLU, should be the best open weights model!), and preliminary results for a 405B model (still training, but already competitive with GPT4). Lots more still to come...
Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
In anonymous blitz Diplomacy games against humans, our agent finished in the top 10%, doubling the average human score. This project was a big collaboration between experts in NLP, game theory and Diplomacy (too many to tag!). Paper here: 5/5
Happy to share MARGE, our new work on rethinking pre-training: given a document, we first retrieve related documents, and then paraphrase these to reconstruct the original. MARGE works well for generation and classification in many languages, sometimes without supervision. (1/6)
Excited to share our work on BART, a method for pre-training seq2seq models by de-noising text. BART outperforms previous work on a bunch of generation tasks (summarization/dialogue/QA), while getting similar performance to RoBERTa on SQuAD/GLUE
@petermilley
It's designed to never intentionally backstab - all its messages correspond to actions it currently plans to take. However, sometimes it changes its mind...
In Diplomacy, 7 players hold private conversations and then make simultaneous moves. The dialogue is used to establish trust and coordinate actions with other players. Here, our agent (green) de-escalates with another player by reassuring them it will not attack them. 2/5
Each game, it sends and receives hundreds of messages, which must be precisely grounded in the game state, dialogue history, and its plans. We developed methods for filtering erroneous messages, letting the agent to pass for human in 40 games. Guess which player is AI here... 4/5
Bored at home? Need a new friend?
Hang out with BART, the newest model available in transformers (thx
@sam_shleifer
) , with the hefty 2.6 release (notes: ). Now you can get state-of-the-art summarization with a few lines of code: 👇👇👇
We automatically labelled training set messages with predicted moves, and use these as control tokens for the LM. During games, the LM is controlled by actions from a planning system. It tries to be helpful - here it (in blue) suggests mutually beneficial moves a human missed 3/5
New paper on scaling language models to sequences of a million bytes! MegaByte splits long byte sequences into fixed-size patches (analogous to tokens), then runs a large model between the patches, and a small model to predict each patch byte-by-byte. 1/
I'm seeing a lot of questions about the limit of how good you can make a small LLM. tldr; benchmarks saturate, models don't. LLMs will improve logarithmically forever with enough good data.
Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
Some people still complain about the lack of peer review on arxiv - but I think it’s *great* for science when you can share a method on Thursday, and have independent groups confirm its effectiveness on Friday.
We know how to train incredible language models, but not how best to generate text with them... Check out
@XiangLisaLi2
's new search objective for open ended text generation, Contrastive Decoding, which outperforms existing approaches by a wide margin in human evaluation! (1/4)
We propose contrastive decoding (CD), a more reliable search objective for text generation by contrasting LMs of different sizes. CD takes a large LM (expert LM e.g. OPT-13b) and a small LM (amateur LM e.g. OPT-125m) and maximizes their logprob difference
The paper is here This "attention sink" trick does allow your LLM work on an unbounded length of input without any degradation in perplexity - but of course that doesn't mean it's really using its full context.
Heading to ICLR! I’m writing fewer papers now to train more Llamas, but proud of our work here: Instruction Backtranslation (), Attention Sinks, () In Context Pretraining () and RA-DIT ().
New paper showing that Contrastive Decoding (CD) works really well for reasoning tasks, e.g. +6 on GSM8K and +4 on HellaSwag compared to greedy. CD searches for strings that are more likely under a good model than a weak model, emphasizing the improvement from the better model.
Contrastive Decoding Improves Reasoning in Large Language Models
paper page:
demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box
Sparse expert architectures are one of the very few approaches that really convincingly, repeatedly beat baseline transformers on language modelling. Great new survey here from Google!
Our survey on sparse expert models describes the advances over the last decade, discusses some difficulties, and presents our view on promising future areas.
Sparsity has been a fun area to work on the last two years. Excited for the models to come.
Very excited to be giving a talk on pre-training at my favourite workshop, REPL4NLP! I discuss several recent projects (RoBERTa, BART, kNN-LM, RAG, MARGE), and argue for how we can improve representation learning for language beyond brute force scaling.
Mike Lewis's (
@ml_perception
) RepL4NLP 2020 keynote talk "Beyond BERT" is now available! 🎉 Here: -- don't forget to join the live Q&A tomorrow at 18:00 PDT!
Can we stop using parameter counts to measure how large our LLMs are? Pre-training compute budget is a better proxy for scale, but I’d rather we framed it as *good* LMs (measured by e.g. perplexity). If not, I can share the first quadrillion parameter LLM (randomly initialized).
Improve your language model by converting it into a deep nearest neighbour classifier! The amazing
@ukhndlwl
pushes SOTA on Wikitext-103 by nearly 3 points, without any additional training (and gets a few other surprising results too).
Excited to share new work!!! “Generalization through Memorization: Nearest Neighbor Language Models”
We introduce kNN-LMs, which extend LMs with nearest neighbor search in embedding space, achieving a new state-of-the-art perplexity on Wikitext-103, without additional training!
At NeurIPS to present our work on agents that "think in language" by generating and then executing plans in the form of natural language instructions. Code and a large new dataset are online!
Researchers from Facebook AI have developed a method that teaches AI to plan by using natural language, and are releasing the real-time strategy game they used to train and evaluate this approach.
@nlpnoah
Mike
@ml_perception
is the collaborator that I spent the most hours with during my PhD and I couldn't be more grateful for that
The best compliment I ever got was when
@Tim_Dettmers
told me he thinks Mike's thinking has become 'simpler' because of his time working with me
@yoavgo
I understand where this is coming from, but we will release a proper research paper when it's ready. I do think it's useful to share these models as soon as possible though!
Today is a good day for open science.
As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and
Virtually presenting in the next two ICLR sessions! Instead of training models with billions of parameters, can we instead have a smaller model with large explicit memory?
@ukhndlwl
certainly sounds excited :-)
Check out
@OfirPress
’s new approach to representing positions in transformer language models: it’s simple, fast, has good inductive bias, and lets you test on longer sequences than you trained on!
Since Transformer LMs were invented, we’ve wanted them to be able to read longer inputs during inference than they saw during training. Our Attention with Linear Biases enables this, in very few lines of code, without requiring extra params or runtime 🧵⬇
If you want a respite from OpenAI drama, how about joining academia?
I'm starting Conceptualization Lab, recruiting PhDs & Postdocs!
We need new abstractions to understand LLMs. Conceptualization is the act of building abstractions to see something new.
New pre-trained models can write great looking summaries, but ROUGE scores don't penalize them for telling lies.
@W4ngatang
has a new metric that correlates much better with factuality.
Our new work "Asking and Answering Questions to Evaluate the Factual Consistency of Summaries" does exactly that. We use question generation and question answering models to evaluate whether summaries are factually consistent w/ the source text.
I found the summarization performance surprisingly good - BART does seem to be able to combine information from across a whole document with background knowledge to produce highly abstractive summaries. Some typical examples beneath:
As
@karpathy
says, tokenization causes *so many* problems! Another is weird effects on nucleus sampling, depending on whether a rare word gets split into frequent tokens or not.
I hope MegaByte can help make future LLMs tokenization-free 4/4
Promising. Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long, so the devil is in the details.
Tokenization means that LLMs are not actually fully end-to-end. There is a whole separate stage with
Weiyan is amazing, and made absolutely critical contributions to the Diplomacy project! She's looking for an academic job, and you should probably hire her...
Super excited to announce our
@ScienceMagazine
paper on Cicero, the first human-level negotiation AI agent with natural language in the 7-player Diplomacy game.
Link:
I am on the academic job market for faculty jobs! I work on NLP!
I'll be answering some questions at 18.00PDT (in 1 hour) - if you have thoughts about pre-training, please come along (whether or not your watched the talk). Also, definitely watch
@ev_fedorenko
's keynote, it's been a long time since I learnt so much from a talk!
Mike Lewis's (
@ml_perception
) RepL4NLP 2020 keynote talk "Beyond BERT" is now available! 🎉 Here: -- don't forget to join the live Q&A tomorrow at 18:00 PDT!
11:30am-12:30pm CST -
@sarahookr
&
@kchonyc
"Debate on Transfer Learning"
2-2:45pm -
@davlanade
"Cross-lingual Transfer for Named Entity Recognition: A study on African Languages"
2:45-3:30 -
@ml_perception
"Training Language Models to Negotiate in the Game of Diplomacy"
2/3
@chrmanning
@harm_devries
@DBahdanau
Obviously there's a place for datasets that reflect real world usage, but I suspect most natural distributions have a fat head that's now quite easy for us to model. We should also design datasets that target all the weird edge cases and weaknesses of our current models.
📣 Announcing the ICLR 2021 Workshop on Enormous Language Models 📣
We have an incredible speaker lineup that covers building, evaluating, critiquing, and improving large LMs, as well as a collaborative parcipant-driven benchmark and 2 panels!
More info:
This does seem closely related to T5 from Google last week. I haven't read that in detail yet, but it seems like we use a slightly different pre-training objective, and better results for the same model size. We haven't tried training an 11B parameter model yet, though :-)
Unlike masked language models, this pre-training objective is closely related to several end tasks, such as summarization, retrieval and translation (e.g. BLEU scores of 35 with the raw pre-trained model). Let's build models that can do more tasks with less supervision! (2/6)
Here it is! The first ever Comcast bill negotiated 100% with A.I and LLMs.
Our
@DoNotPay
ChatGPT bot talks to Comcast Chat to save one of our engineers $120 a year on their Internet bill.
Will be publicly available soon and work on online forms, chat and email.
@ykilcher
The "global" tokens in BigBird are quite different from attention sinks, because global tokens aggregate info across *all* tokens in a non-causal LM. Attention sinks see *no* other tokens (even their own token is irrelevant!), and capture null attention results in causal models.
To my knowledge, this is the first competitive alternative to variants of masked language modelling. I hope this work both encourages exploration of other alternatives to MLMs, and leads to better understanding of what pre-training is really learning. (5/6)
We test MegaByte on character level language modeling (big gains over prior byte-level models, competitive with subwords), byte-level image modelling (SOTA results) and modelling audio from raw files (no pre-processing!). MegaByte should really simplify multi-modal modelling! 3/
There's lots of work on efficient attention (MegaByte is O(n^4/3)), but the real problem with modeling long sequences is the use of big feedforwards per-position. Patchifying allows larger layers shared over multiple bytes, which focus on the hard decisions within each patch. 2/
By retrieving relevant facts during pre-training, the model can focus on learning to paraphrase, rather than memorizing encyclopedic knowledge. This approach also seems to make MARGE somewhat less prone to hallucinating facts when generating. (3/6)
@FelixHill84
@kchonyc
Sorry we didn't cite in the BART paper - I would have if I'd seen it (I wasn't really following sentence embedding work). It's interesting how early most of the key pre-training ideas came, yet it still took a couple more years to put them together in the right way.
It bothers me that we use nucleus sampling for long form output, and greedy decoding when we want a single right answer. Combined with
@XiangLisaLi2
's results last year for open ended generation (), Contrastive Decoding is now shown to work great for both.
Please express all critiques as limericks! However, much of our Diplomacy work involved giving us precise control over a LM, so that the agent's messages corresponded to its intended actions - which seems in keeping with the alignment agenda.
limerick?
it looks like our plans for alignment
might need just a little refinement!
let's hope that this guy
won't give this ai
a paperclip-making assignment!
Contrastive Decoding is easy to implement and cheap to run, give it a try! There are several interpretations of why it works, one is that the stronger model is constructing an informative string to communicate to its mental model of a listener (the weak model).
This auto-encoder framework simplifies the retrieval problem (compared to related recent models like REALM and RAG), allowing us to train the retrieval and reconstruction models jointly from a single objective and a random initialization. (4/6)
@nsaphra
Zero-shot/fine-tuned task performance would also be fine with me for measuring the quality of the model. But really, anything except parameter counts would be progress…
Instead, we propose searching for strings that are much more likely with a good LM than a weaker LM. These strings show interesting behaviours that the stronger LM has learnt but the weaker one hasn't (other interpretations in the paper). (4/4)
Everyone thinks that you have to increase the input length of language models to improve their performance. Our new Shortformer model shows that by *shortening* inputs performance improves while speed and memory efficiency go up. ⬇(1/n) (code below)
@kroscoo
Yeah, my point is that there are lots of ways to increase parameter counts without proportionally improving quality (mixture of experts layers is another obvious one). I think it’s pretty rare that parameter counts are even worth reporting.
@sivil_taram
Sorry for the pain here! The problem was caused by an unfortunate config bug in the original BART-large training run, which caused decoder sequences to start with </s> <s> … If you’re seeing issues, try setting these tokens during fine tuning and generation. Hope that helps!
@ale_suglia
@huggingface
Given all the magical things Transformers can learn without supervision, I'm pretty sure they can figure out segments for themselves!
@sleepinyourhat
Well, GPT-3's few shot learning would work a lot better if it had solved language understanding - but I think that kind of set up is a much better test than supervised training on large datasets.
@delliott
MegaByte scales to very long sequences, so I doubt there's a big problem here (BPE might be worse than UTF-8 for unevenly encoding languages). If you’re training a model for a language with this issue, increasing patch size while decreasing byte embed size would make it ~free.
@cbrew
Thanks! We did play a bunch of internal games with a mix of humans and bots, and sometimes it took a surprisingly long time to identify the bots (even knowing what to look for).
@VeredShwartz
What's wrong with just taking the product of all the token probabilities (including the end-of-sentence symbol)? If your language model is good, it should prefer typical length sequences from its training data.
@paul_scharre
@HaydnBelfield
@DavidSKrueger
@MichaelD1729
@polynoamial
Cicero's modular design means that deception can’t emerge (in the sense you mean), and also that we can often understand its decisions. Apparent deception happens either when it changes its mind after sending a message, or from failing to accurately describe its plan in language.
@_julianmichael_
@VeredShwartz
Prove me wrong, but I'd be amazed if strong MT models put significant probability mass on the empty string (if it never happens in the training set). Even if LMs aren't good enough yet, let's put our effort into making better ones, instead of hacking around their limitations.
@yoavgo
Off the top of my head: zero/few shot learning, new pre-training methods that work well in that setting, and naturally occurring train/test sets (or maybe adversarial ones).
@peterjliu
At least part of it is because there's a book in validation which is way out of distribution (mentioned here: ). Valid and test also only contain 50/100 books respectively, which might explain why relative orderings of models are different on valid/test.
@ThomasScialom
@OpenAI
I haven't seen that, but it would be interesting! You might also be interested in MARGE, which uses an unsupervised pre-training objective that is closely related to several downstream tasks, and works really well zero-shot in some cases:
@sleepinyourhat
I *think* the lawyers would let us train a model for research purposes and publish the result, but the no-derivatives bit would stop us releasing the model.
@annargrs
Let's say not ACL-worthy. 75-80% of all submissions are rejected, so even if we're completely balanced then a big majority of dataset papers won't get in. You suggest some criteria that reviewers shouldn't use to reject papers, but what do you think should they be using?
@yoavgo
But I don't know of datasets with a 100k crowdsourced (non adversarial) examples where models don't get superhuman accuracy. If there are any, we can just pre-train Roberta a bit longer...
@yoavgo
Yeah, I should've added that I meant as a research problem (because it works). Clearly these datasets have been incredibly important, and have driven lots of progress - but if they're not useful to researchers in the future, are they worth teaching? I wouldn't know.
Sampling from LMs avoids this problem, but can introduce errors from unlucky samples, leading to truncated sampling schemes like topk or nucleus. But sampling only aims to produce average quality strings from the LM's distribution, not the best ones. (3/4)
Searching for the most likely strings gives short, repetitive or boring output. Intuitively - for any long/interesting string, there are a huge number of slight variations of similar quality, so no individual string has high likelihood (even if they're collectively likely). (2/4)