Mike Lewis @ml_perception Twitter profile | Pikagi

Pikagi

Mike Lewis

@ml_perception

7,091

Followers

232

Following

11

Media

268

Statuses

Llama3 pre-training lead. Partially to blame for things like the Cicero Diplomacy bot, BART, RoBERTa, kNN-LM, top-k sampling & Deal Or No Deal.

Seattle

Joined September 2019

Don't wanna be here? Send us removal request.

Pinned Tweet

@ml_perception

Mike Lewis

3 months

tldr; you can go a long way in pre-training by (1) curating amazing data, (2) using a lot of FLOPs, and (3) otherwise not screwing up. All three are harder than they sound, so read the paper... That said, I'm amazed by our progress since Llama 3 - expect big things from Llama 4!

@ml_perception

Mike Lewis

3 months

So excited for the open release of Llama 3.1 405B - with MMLU > 87, it's a really strong model and I can't wait to see what you all build with it! Also check out the paper here, with lots of details on how this was made:

3

20

178

4

15

164

Last Seen Profiles

@bokeplokalmalam

@Byronprimarysch

@mOoshvab

@Sblevi236372

@lulustarrrrr

@Y80_3

@GarrettShrader6

@Ahme33937Ahmed

@BTCNashville

@heartwaveb

@LongMint9

@shanicucic96

@bokeplokalmalam

@Leannedini

@Dkhoonemirates

@babylightonInj

@TheAFCCL_jp

@amayaa_l

@salp_hany

@88_abus

@chel_amine

@MovieStonks

@zaharad73905891

@almltqyalaslamy

@PhoenixSandBird

@duhhlajah

@DH_Ensemble

@stw_pdg

@Coyote_Sweets

@n_dobrodiyka

@FIDE_chess

@ekinakyaz

@tveitdal

@OskarFors

@universe__42

@esnyvocalss

@ml_perception

Mike Lewis

2 years

New paper in Science today on playing the classic negotiation game "Diplomacy" at a human level, by connecting language models with strategic reasoning! Our agent engages in intense and lengthy dialogues to persuade other players to follow its plans. This was really hard! 1/5

Tweet media one

101

806

4K

@ml_perception

Mike Lewis

6 months

Excited to share a preview of Llama3, including the release of an 8B and 70B (82 MMLU, should be the best open weights model!), and preliminary results for a 405B model (still training, but already competitive with GPT4). Lots more still to come...

Introducing Meta Llama 3: The most capable openly available LLM to date

Today, we’re introducing Meta Llama 3, the next generation of our state-of-the-art open source large language model. In the coming months, we expect to share new capabilities, additional model sizes,...

18

97

507

@ml_perception

Mike Lewis

6 months

Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.

@felix_red_panda

Felix

@felix_red_panda

6 months

Llama3 8B is trained on almost 100 times the Chinchilla optimal number of tokens

Tweet media one

7

7

181

14

37

500

@ml_perception

Mike Lewis

2 years

In anonymous blitz Diplomacy games against humans, our agent finished in the top 10%, doubling the average human score. This project was a big collaboration between experts in NLP, game theory and Diplomacy (too many to tag!). Paper here: 5/5

Tweet media one

12

59

480

@ml_perception

Mike Lewis

4 years

Happy to share MARGE, our new work on rethinking pre-training: given a document, we first retrieve related documents, and then paraphrase these to reconstruct the original. MARGE works well for generation and classification in many languages, sometimes without supervision. (1/6)

Tweet media one

8

92

411

@ml_perception

Mike Lewis

5 years

Excited to share our work on BART, a method for pre-training seq2seq models by de-noising text. BART outperforms previous work on a bunch of generation tasks (summarization/dialogue/QA), while getting similar performance to RoBERTa on SQuAD/GLUE

5

90

379

@ml_perception

Mike Lewis

2 years

@petermilley It's designed to never intentionally backstab - all its messages correspond to actions it currently plans to take. However, sometimes it changes its mind...

16

4

229

@ml_perception

Mike Lewis

2 years

In Diplomacy, 7 players hold private conversations and then make simultaneous moves. The dialogue is used to establish trust and coordinate actions with other players. Here, our agent (green) de-escalates with another player by reassuring them it will not attack them. 2/5

Tweet media one

1

15

218

@ml_perception

Mike Lewis

2 years

Each game, it sends and receives hundreds of messages, which must be precisely grounded in the game state, dialogue history, and its plans. We developed methods for filtering erroneous messages, letting the agent to pass for human in 40 games. Guess which player is AI here... 4/5

Tweet media one

9

14

207

@ml_perception

Mike Lewis

5 years

BART is now ridiculously easy to use. Great work as always by @huggingface !

@huggingface

Hugging Face

5 years

Bored at home? Need a new friend? Hang out with BART, the newest model available in transformers (thx @sam_shleifer ) , with the hefty 2.6 release (notes: ). Now you can get state-of-the-art summarization with a few lines of code: 👇👇👇

Tweet media one

14

208

753

0

38

192

@ml_perception

Mike Lewis

2 years

We automatically labelled training set messages with predicted moves, and use these as control tokens for the LM. During games, the LM is controlled by actions from a planning system. It tries to be helpful - here it (in blue) suggests mutually beneficial moves a human missed 3/5

Tweet media one

3

9

179

@ml_perception

Mike Lewis

1 year

New paper on scaling language models to sequences of a million bytes! MegaByte splits long byte sequences into fixed-size patches (analogous to tokens), then runs a large model between the patches, and a small model to predict each patch byte-by-byte. 1/

@_akhaliq

AK

1 year

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers abs: paper page:

Tweet media one

15

222

1K

4

31

173

@ml_perception

Mike Lewis

6 months

I'm seeing a lot of questions about the limit of how good you can make a small LLM. tldr; benchmarks saturate, models don't. LLMs will improve logarithmically forever with enough good data.

@ml_perception

Mike Lewis

6 months

Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.

14

37

500

6

14

173

@ml_perception

Mike Lewis

3 years

Some people still complain about the lack of peer review on arxiv - but I think it’s *great* for science when you can share a method on Thursday, and have independent groups confirm its effectiveness on Friday.

4

14

133

@ml_perception

Mike Lewis

2 years

We know how to train incredible language models, but not how best to generate text with them... Check out @XiangLisaLi2 's new search objective for open ended text generation, Contrastive Decoding, which outperforms existing approaches by a wide margin in human evaluation! (1/4)

@XiangLisaLi2

Xiang Lisa Li

2 years

We propose contrastive decoding (CD), a more reliable search objective for text generation by contrasting LMs of different sizes. CD takes a large LM (expert LM e.g. OPT-13b) and a small LM (amateur LM e.g. OPT-125m) and maximizes their logprob difference

Tweet media one

8

121

714

2

20

125

@ml_perception

Mike Lewis

1 year

The paper is here This "attention sink" trick does allow your LLM work on an unbounded length of input without any degradation in perplexity - but of course that doesn't mean it's really using its full context.

@ctjlewis

Lewis

1 year

if it wasn’t MIT i’d scream bullshit. but?

Tweet media one

32

51

793

1

19

110

@ml_perception

Mike Lewis

5 months

Heading to ICLR! I’m writing fewer papers now to train more Llamas, but proud of our work here: Instruction Backtranslation (), Attention Sinks, () In Context Pretraining () and RA-DIT ().

Tweet card media

RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require...

5

10

123

@ml_perception

Mike Lewis

1 year

New paper showing that Contrastive Decoding (CD) works really well for reasoning tasks, e.g. +6 on GSM8K and +4 on HellaSwag compared to greedy. CD searches for strings that are more likely under a good model than a weak model, emphasizing the improvement from the better model.

@_akhaliq

AK

1 year

Contrastive Decoding Improves Reasoning in Large Language Models paper page: demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box

Tweet media one

2

108

538

2

18

115

@ml_perception

Mike Lewis

2 years

Sparse expert architectures are one of the very few approaches that really convincingly, repeatedly beat baseline transformers on language modelling. Great new survey here from Google!

@LiamFedus

William Fedus

2 years

Our survey on sparse expert models describes the advances over the last decade, discusses some difficulties, and presents our view on promising future areas. Sparsity has been a fun area to work on the last two years. Excited for the models to come.

Tweet media one

7

72

350

0

14

103

@ml_perception

Mike Lewis

4 years

Very excited to be giving a talk on pre-training at my favourite workshop, REPL4NLP! I discuss several recent projects (RoBERTa, BART, kNN-LM, RAG, MARGE), and argue for how we can improve representation learning for language beyond brute force scaling.

@sigrep_acl

SIGREP

4 years

Mike Lewis's ( @ml_perception ) RepL4NLP 2020 keynote talk "Beyond BERT" is now available! 🎉 Here: -- don't forget to join the live Q&A tomorrow at 18:00 PDT!

0

7

37

0

10

89

@ml_perception

Mike Lewis

3 years

Can we stop using parameter counts to measure how large our LLMs are? Pre-training compute budget is a better proxy for scale, but I’d rather we framed it as *good* LMs (measured by e.g. perplexity). If not, I can share the first quadrillion parameter LLM (randomly initialized).

@yoavgo

(((ل()(ل() 'yoav))))👾

3 years

NLPers, when you use "LLM" you mean:

11

4

17

8

6

83

@ml_perception

Mike Lewis

2 years

With: @ad_optimum @adamlerer @alex_h_miller @anton_bakhtin @apjacob03 @aweisawei @colin__flaherty @dan_fried @darkwolf0010010 @em_dinan @gabrfarina @HengyuanH @hughbzhang @jgrayatwork @joespeez @lightvector1 @MinaeKwon @MKomeili @rendu_a @shi_weiyan @stephenroller 6/5

7

2

74

@ml_perception

Mike Lewis

5 years

Improve your language model by converting it into a deep nearest neighbour classifier! The amazing @ukhndlwl pushes SOTA on Wikitext-103 by nearly 3 points, without any additional training (and gets a few other surprising results too).

Tweet card media

Generalization through Memorization: Nearest Neighbor Language Models

We introduce $k$NN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according...

@ukhndlwl

Urvashi Khandelwal

5 years

Excited to share new work!!! “Generalization through Memorization: Nearest Neighbor Language Models” We introduce kNN-LMs, which extend LMs with nearest neighbor search in embedding space, achieving a new state-of-the-art perplexity on Wikitext-103, without additional training!

Tweet media one

7

161

636

1

15

73

@ml_perception

Mike Lewis

2 years

I did my first PhD Offence today - congratulations Dr Alex Wang!!

@sleepinyourhat

Sam Bowman

@sleepinyourhat

2 years

Congrats to @W4ngatang for a successful dissertation defense today!

Tweet media one

Tweet media two

13

1

131

1

1

68

@ml_perception

Mike Lewis

5 years

At NeurIPS to present our work on agents that "think in language" by generating and then executing plans in the form of natural language instructions. Code and a large new dataset are online!

Tweet card media

Hierarchical Decision Making by Generating and Following Natural...

We explore using latent natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting...

@AIatMeta

AI at Meta

5 years

Researchers from Facebook AI have developed a method that teaches AI to plan by using natural language, and are releasing the real-time strategy game they used to train and evaluate this approach.

Tweet media one

1

86

261

0

10

63

@ml_perception

Mike Lewis

1 year

The best thing about "advising" people is how much smarter they make you. Thanks @OfirPress and all the rest of you!

@OfirPress

Ofir Press

1 year

@nlpnoah Mike @ml_perception is the collaborator that I spent the most hours with during my PhD and I couldn't be more grateful for that The best compliment I ever got was when @Tim_Dettmers told me he thinks Mike's thinking has become 'simpler' because of his time working with me

Tweet media one

1

0

20

1

0

46

@ml_perception

Mike Lewis

6 months

@yoavgo I understand where this is coming from, but we will release a proper research paper when it's ready. I do think it's useful to share these models as soon as possible though!

4

0

61

@ml_perception

Mike Lewis

4 months

Excited to see the open source release of FAIR's early fusion multimodal LLMs!

@AIatMeta

AI at Meta

4 months

Today is a good day for open science. As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and

98

527

2K

0

4

47

@ml_perception

Mike Lewis

4 years

Virtually presenting in the next two ICLR sessions! Instead of training models with billions of parameters, can we instead have a smaller model with large explicit memory? @ukhndlwl certainly sounds excited :-)

0

4

44

@ml_perception

Mike Lewis

2 years

Jason Weston and I are looking to co-host a postdoc at FAIR! The topic is flexible within NLP as long as everyone's excited. Apply at:

@jaseweston

Jason Weston

2 years

NLP Postdoc opportunity at FAIR! Mike Lewis ( @ml_perception ) and I are seeking to co-mentor. Apply here:

2

19

104

0

5

43

@ml_perception

Mike Lewis

3 years

Check out @OfirPress ’s new approach to representing positions in transformer language models: it’s simple, fast, has good inductive bias, and lets you test on longer sequences than you trained on!

@OfirPress

Ofir Press

3 years

Since Transformer LMs were invented, we’ve wanted them to be able to read longer inputs during inference than they saw during training. Our Attention with Linear Biases enables this, in very few lines of code, without requiring extra params or runtime 🧵⬇

Tweet media one

8

158

666

0

0

35

@ml_perception

Mike Lewis

11 months

Ari is an incredibly creative researcher and a great mentor - I can't wait to see what comes out of his lab!

@universeinanegg

Ari Holtzman really enjoyed COLM

@universeinanegg

11 months

If you want a respite from OpenAI drama, how about joining academia? I'm starting Conceptualization Lab, recruiting PhDs & Postdocs! We need new abstractions to understand LLMs. Conceptualization is the act of building abstractions to see something new.

14

63

277

1

1

15

@ml_perception

Mike Lewis

5 years

@YinhanL @gh_marjan @omerlevy_ @vesko_st @LukeZettlemoyer

Tweet card media

BART: Denoising Sequence-to-Sequence Pre-training for Natural...

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to...

1

2

29

@ml_perception

Mike Lewis

5 years

New pre-trained models can write great looking summaries, but ROUGE scores don't penalize them for telling lies. @W4ngatang has a new metric that correlates much better with factuality.

@W4ngatang

Alex Wang

5 years

Our new work "Asking and Answering Questions to Evaluate the Factual Consistency of Summaries" does exactly that. We use question generation and question answering models to evaluate whether summaries are factually consistent w/ the source text.

Tweet media one

4

39

168

1

2

26

@ml_perception

Mike Lewis

5 years

I found the summarization performance surprisingly good - BART does seem to be able to combine information from across a whole document with background knowledge to produce highly abstractive summaries. Some typical examples beneath:

Tweet media one

2

4

25

@ml_perception

Mike Lewis

3 years

Great work from DeepMind showing just how well expert-parallel transformers scale up

@_akhaliq

AK

3 years

Unified Scaling Laws for Routed Language Models abs:

Tweet media one

1

13

67

0

4

23

@ml_perception

Mike Lewis

1 year

As @karpathy says, tokenization causes *so many* problems! Another is weird effects on nucleus sampling, depending on whether a rare word gets split into frequent tokens or not. I hope MegaByte can help make future LLMs tokenization-free 4/4

@karpathy

Andrej Karpathy

1 year

Promising. Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long, so the devil is in the details. Tokenization means that LLMs are not actually fully end-to-end. There is a whole separate stage with

90

599

4K

1

3

21

@ml_perception

Mike Lewis

2 years

Weiyan is amazing, and made absolutely critical contributions to the Diplomacy project! She's looking for an academic job, and you should probably hire her...

@shi_weiyan

Weiyan Shi

2 years

Super excited to announce our @ScienceMagazine paper on Cicero, the first human-level negotiation AI agent with natural language in the 7-player Diplomacy game. Link: I am on the academic job market for faculty jobs! I work on NLP!

4

9

66

1

2

22

@ml_perception

Mike Lewis

4 years

I'll be answering some questions at 18.00PDT (in 1 hour) - if you have thoughts about pre-training, please come along (whether or not your watched the talk). Also, definitely watch @ev_fedorenko 's keynote, it's been a long time since I learnt so much from a talk!

@sigrep_acl

SIGREP

4 years

Mike Lewis's ( @ml_perception ) RepL4NLP 2020 keynote talk "Beyond BERT" is now available! 🎉 Here: -- don't forget to join the live Q&A tomorrow at 18:00 PDT!

0

7

37

3

6

19

@ml_perception

Mike Lewis

2 years

Excited to be giving a talk about the language models behind our Cicero Diplomacy agent at the Transfer Learning workshop (2.45, Theater C)!

@AlbalakAlon

Alon Albalak

2 years

11:30am-12:30pm CST - @sarahookr & @kchonyc "Debate on Transfer Learning" 2-2:45pm - @davlanade "Cross-lingual Transfer for Named Entity Recognition: A study on African Languages" 2:45-3:30 - @ml_perception "Training Language Models to Negotiate in the Game of Diplomacy" 2/3

1

0

6

1

2

19

@ml_perception

Mike Lewis

4 years

Paper is here: Joint work with amazing co-authors @gh_marjan , Gargi Ghosh, Armen Aghajanyan, Sida Wang, @LukeZettlemoyer (6/6)

Tweet card media

Pre-training via Paraphrasing

We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an alternative to the dominant masked...

0

1

18

@ml_perception

Mike Lewis

4 years

@chrmanning @harm_devries @DBahdanau Obviously there's a place for datasets that reflect real world usage, but I suspect most natural distributions have a fat head that's now quite easy for us to model. We should also design datasets that target all the weird edge cases and weaknesses of our current models.

1

4

18

@ml_perception

Mike Lewis

4 years

@_rockt Wow, imagine how trivial it must be to make RL work in a language-only environment - that's only 1D!

1

0

18

@ml_perception

Mike Lewis

4 years

Really excited about this workshop!

@colinraffel

Colin Raffel

4 years

📣 Announcing the ICLR 2021 Workshop on Enormous Language Models 📣 We have an incredible speaker lineup that covers building, evaluating, critiquing, and improving large LMs, as well as a collaborative parcipant-driven benchmark and 2 panels! More info:

6

47

249

0

0

18

@ml_perception

Mike Lewis

5 years

This does seem closely related to T5 from Google last week. I haven't read that in detail yet, but it seems like we use a slightly different pre-training objective, and better results for the same model size. We haven't tried training an 11B parameter model yet, though :-)

1

0

16

@ml_perception

Mike Lewis

2 years

If you'd like to get a sense for the strength of our Diplomacy agent, check out this detailed video from an expert human playing 6 bots!

0

0

16

@ml_perception

Mike Lewis

4 years

Unlike masked language models, this pre-training objective is closely related to several end tasks, such as summarization, retrieval and translation (e.g. BLEU scores of 35 with the raw pre-trained model). Let's build models that can do more tasks with less supervision! (2/6)

1

0

15

@ml_perception

Mike Lewis

2 years

Real world application for negotation AI!

@jbrowder1

Joshua Browder

2 years

Here it is! The first ever Comcast bill negotiated 100% with A.I and LLMs. Our @DoNotPay ChatGPT bot talks to Comcast Chat to save one of our engineers $120 a year on their Internet bill. Will be publicly available soon and work on online forms, chat and email.

191

1K

11K

0

1

12

@ml_perception

Mike Lewis

1 year

@ykilcher The "global" tokens in BigBird are quite different from attention sinks, because global tokens aggregate info across *all* tokens in a non-causal LM. Attention sinks see *no* other tokens (even their own token is irrelevant!), and capture null attention results in causal models.

0

0

13

@ml_perception

Mike Lewis

4 years

To my knowledge, this is the first competitive alternative to variants of masked language modelling. I hope this work both encourages exploration of other alternatives to MLMs, and leads to better understanding of what pre-training is really learning. (5/6)

1

0

13

@ml_perception

Mike Lewis

1 year

We test MegaByte on character level language modeling (big gains over prior byte-level models, competitive with subwords), byte-level image modelling (SOTA results) and modelling audio from raw files (no pre-processing!). MegaByte should really simplify multi-modal modelling! 3/

Tweet media one

1

0

13

@ml_perception

Mike Lewis

1 year

There's lots of work on efficient attention (MegaByte is O(n^4/3)), but the real problem with modeling long sequences is the use of big feedforwards per-position. Patchifying allows larger layers shared over multiple bytes, which focus on the hard decisions within each patch. 2/

1

1

11

@ml_perception

Mike Lewis

4 years

By retrieving relevant facts during pre-training, the model can focus on learning to paraphrase, rather than memorizing encyclopedic knowledge. This approach also seems to make MARGE somewhat less prone to hallucinating facts when generating. (3/6)

1

1

11

@ml_perception

Mike Lewis

1 year

With an amazing team: @liliyu_lili @simigd @colin__flaherty @ArmenAgha @LukeZettlemoyer 5/4

1

0

10

@ml_perception

Mike Lewis

2 years

@FelixHill84 @kchonyc Sorry we didn't cite in the BART paper - I would have if I'd seen it (I wasn't really following sentence embedding work). It's interesting how early most of the key pre-training ideas came, yet it still took a couple more years to put them together in the right way.

1

0

9

@ml_perception

Mike Lewis

1 year

It bothers me that we use nucleus sampling for long form output, and greedy decoding when we want a single right answer. Combined with @XiangLisaLi2 's results last year for open ended generation (), Contrastive Decoding is now shown to work great for both.

Tweet media one

1

0

8

@ml_perception

Mike Lewis

2 years

Please express all critiques as limericks! However, much of our Diplomacy work involved giving us precise control over a LM, so that the agent's messages corresponded to its intended actions - which seems in keeping with the alignment agenda.

@BellaRudd1

Bella Rudd

2 years

limerick? it looks like our plans for alignment might need just a little refinement! let's hope that this guy won't give this ai a paperclip-making assignment!

0

3

39

0

0

8

@ml_perception

Mike Lewis

5 years

Joint work with @YinhanL Naman Goyal @gh_marjan Abdelrahman Mohamed, @omerlevy_ @vesko_st @LukeZettlemoyer

1

1

7

@ml_perception

Mike Lewis

1 year

Contrastive Decoding is easy to implement and cheap to run, give it a try! There are several interpretations of why it works, one is that the stronger model is constructing an informative string to communicate to its mental model of a listener (the weak model).

Tweet media one

1

0

7

@ml_perception

Mike Lewis

6 months

@soldni @natolambert @rosstaylor90 The blog is correct (sigh...)

2

0

7

@ml_perception

Mike Lewis

4 years

This auto-encoder framework simplifies the retrieval problem (compared to related recent models like REALM and RAG), allowing us to train the retrieval and reconstruction models jointly from a single objective and a random initialization. (4/6)

2

0

7

@ml_perception

Mike Lewis

4 years

@b_niranjan BART (in Fairseq and Huggingface Transformers) is designed to be good at this kind of task.

1

0

6

@ml_perception

Mike Lewis

5 years

@timohear Thanks! Code and pre-trained models will be available very soon.

0

0

6

@ml_perception

Mike Lewis

3 years

@nsaphra Zero-shot/fine-tuned task performance would also be fine with me for measuring the quality of the model. But really, anything except parameter counts would be progress…

0

0

6

@ml_perception

Mike Lewis

2 years

Instead, we propose searching for strings that are much more likely with a good LM than a weaker LM. These strings show interesting behaviours that the stronger LM has learnt but the weaker one hasn't (other interpretations in the paper). (4/4)

Tweet media one

0

0

5

@ml_perception

Mike Lewis

4 years

@OfirPress has a gift for finding 2-line code changes that make a big difference.

@OfirPress

Ofir Press

4 years

Everyone thinks that you have to increase the input length of language models to improve their performance. Our new Shortformer model shows that by *shortening* inputs performance improves while speed and memory efficiency go up. ⬇(1/n) (code below)

Tweet media one

8

89

545

0

0

5

@ml_perception

Mike Lewis

1 year

This work was done by the amazing Sean O'Brien during his residency at FAIR. Look out for more great things from him in the future!

0

0

4

@ml_perception

Mike Lewis

3 years

@gneubig @complingy @tallinzen @ReviewAcl Very anecdotally, whatever NeurIPS did this year seemed to give much better automatic suggestions for reviewers.

0

0

4

@ml_perception

Mike Lewis

4 years

@hllo_wrld @huggingface @SanhEstPasMoi How small do you want? I have a 140M parameter version I can share?

1

0

4

@ml_perception

Mike Lewis

3 years

@kroscoo Yeah, my point is that there are lots of ways to increase parameter counts without proportionally improving quality (mixture of experts layers is another obvious one). I think it’s pretty rare that parameter counts are even worth reporting.

0

0

4

@ml_perception

Mike Lewis

2 years

@sivil_taram Sorry for the pain here! The problem was caused by an unfortunate config bug in the original BART-large training run, which caused decoder sequences to start with </s> <s> … If you’re seeing issues, try setting these tokens during fine tuning and generation. Hope that helps!

1

0

4

@ml_perception

Mike Lewis

5 months

@OfirPress I should probably apply for an internship with you!

1

0

4

@ml_perception

Mike Lewis

5 years

@srush_nlp I also remember spending a couple of days trying and failing to get a non-embarrassing number on WikiText 2...

0

0

3

@ml_perception

Mike Lewis

3 years

@gneubig I know of a few other attempts at directions like this, but it's really hard to avoid hurting perplexity - great work!

0

0

3

@ml_perception

Mike Lewis

4 years

@ale_suglia @huggingface Given all the magical things Transformers can learn without supervision, I'm pretty sure they can figure out segments for themselves!

0

0

3

@ml_perception

Mike Lewis

4 years

@sleepinyourhat Well, GPT-3's few shot learning would work a lot better if it had solved language understanding - but I think that kind of set up is a much better test than supervised training on large datasets.

0

0

3

@ml_perception

Mike Lewis

5 years

@nsaphra BERT suggests team/group/trio. We've finally found a task it sucks at...

0

0

3

@ml_perception

Mike Lewis

9 months

@ssgrn @Meta Congrats!! So excited to work with you again!

0

0

2

@ml_perception

Mike Lewis

1 year

@delliott MegaByte scales to very long sequences, so I doubt there's a big problem here (BPE might be worse than UTF-8 for unevenly encoding languages). If you’re training a model for a language with this issue, increasing patch size while decreasing byte embed size would make it ~free.

2

0

3

@ml_perception

Mike Lewis

2 years

@cbrew Thanks! We did play a bunch of internal games with a mix of humans and bots, and sometimes it took a surprisingly long time to identify the bots (even knowing what to look for).

1

1

2

@ml_perception

Mike Lewis

4 years

@VeredShwartz What's wrong with just taking the product of all the token probabilities (including the end-of-sentence symbol)? If your language model is good, it should prefer typical length sequences from its training data.

1

0

2

@ml_perception

Mike Lewis

3 years

@deviparikh I love how this tweet is illustrated with a calendar that accounts for almost every hour of every day :-)

1

0

2

@ml_perception

Mike Lewis

2 years

@paul_scharre @HaydnBelfield @DavidSKrueger @MichaelD1729 @polynoamial Cicero's modular design means that deception can’t emerge (in the sense you mean), and also that we can often understand its decisions. Apparent deception happens either when it changes its mind after sending a message, or from failing to accurately describe its plan in language.

1

0

2

@ml_perception

Mike Lewis

4 years

@_julianmichael_ @VeredShwartz Prove me wrong, but I'd be amazed if strong MT models put significant probability mass on the empty string (if it never happens in the training set). Even if LMs aren't good enough yet, let's put our effort into making better ones, instead of hacking around their limitations.

2

0

2

@ml_perception

Mike Lewis

4 years

@yoavgo Off the top of my head: zero/few shot learning, new pre-training methods that work well in that setting, and naturally occurring train/test sets (or maybe adversarial ones).

2

0

2

@ml_perception

Mike Lewis

5 years

@yurakuratov Code/models will be available soon!

0

0

2

@ml_perception

Mike Lewis

1 year

@peterjliu At least part of it is because there's a book in validation which is way out of distribution (mentioned here: ). Valid and test also only contain 50/100 books respectively, which might explain why relative orderings of models are different on valid/test.

1

0

2

@ml_perception

Mike Lewis

1 year

@shi_weiyan @Zhou_Yu_AI @jaseweston @columbianlp @MetaAI @ucdavis Congrats Weiyan!!!!

0

0

2

@ml_perception

Mike Lewis

4 years

@ThomasScialom @OpenAI I haven't seen that, but it would be interesting! You might also be interested in MARGE, which uses an unsupervised pre-training objective that is closely related to several downstream tasks, and works really well zero-shot in some cases:

Tweet card media

Pre-training via Paraphrasing

We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual multi-document paraphrasing objective. MARGE provides an alternative to the dominant masked...

0

0

2

@ml_perception

Mike Lewis

4 years

@sleepinyourhat I *think* the lawyers would let us train a model for research purposes and publish the result, but the no-derivatives bit would stop us releasing the model.

0

0

2

@ml_perception

Mike Lewis

5 years

@annargrs Let's say not ACL-worthy. 75-80% of all submissions are rejected, so even if we're completely balanced then a big majority of dataset papers won't get in. You suggest some criteria that reviewers shouldn't use to reject papers, but what do you think should they be using?

1

0

1

@ml_perception

Mike Lewis

2 years

@ryandcotterell It's actually named after the Pecsaetan tribe, who used to inhabit the area.

0

0

1

@ml_perception

Mike Lewis

4 years

@amitness The former.

0

0

1

@ml_perception

Mike Lewis

2 years

@_julianmichael_ @LukeZettlemoyer @emilymbender @nlpnoah @ssshanest Congrats Julian!

0

0

1

@ml_perception

Mike Lewis

4 years

@belinda_nlp @jacobandreas Awesome, congrats!

1

0

1

@ml_perception

Mike Lewis

4 years

@yoavgo But I don't know of datasets with a 100k crowdsourced (non adversarial) examples where models don't get superhuman accuracy. If there are any, we can just pre-train Roberta a bit longer...

1

0

1

@ml_perception

Mike Lewis

4 years

@yoavgo Yeah, I should've added that I meant as a research problem (because it works). Clearly these datasets have been incredibly important, and have driven lots of progress - but if they're not useful to researchers in the future, are they worth teaching? I wouldn't know.

1

0

1

@ml_perception

Mike Lewis

2 years

Sampling from LMs avoids this problem, but can introduce errors from unlucky samples, leading to truncated sampling schemes like topk or nucleus. But sampling only aims to produce average quality strings from the LM's distribution, not the best ones. (3/4)

1

0

1

@ml_perception

Mike Lewis

2 years

Searching for the most likely strings gives short, repetitive or boring output. Intuitively - for any long/interesting string, there are a huge number of slight variations of similar quality, so no individual string has high likelihood (even if they're collectively likely). (2/4)

1

0

1

@ml_perception

Mike Lewis

4 years

@ukhndlwl @MSFTResearch Well deserved, congrats!!

0

0

1