Mike Lewis Profile
Mike Lewis

@ml_perception

7,091
Followers
232
Following
11
Media
268
Statuses

Llama3 pre-training lead. Partially to blame for things like the Cicero Diplomacy bot, BART, RoBERTa, kNN-LM, top-k sampling & Deal Or No Deal.

Seattle
Joined September 2019
Don't wanna be here? Send us removal request.
Pinned Tweet
@ml_perception
Mike Lewis
3 months
tldr; you can go a long way in pre-training by (1) curating amazing data, (2) using a lot of FLOPs, and (3) otherwise not screwing up. All three are harder than they sound, so read the paper... That said, I'm amazed by our progress since Llama 3 - expect big things from Llama 4!
@ml_perception
Mike Lewis
3 months
So excited for the open release of Llama 3.1 405B - with MMLU > 87, it's a really strong model and I can't wait to see what you all build with it! Also check out the paper here, with lots of details on how this was made:
3
20
178
4
15
164
@ml_perception
Mike Lewis
2 years
New paper in Science today on playing the classic negotiation game "Diplomacy" at a human level, by connecting language models with strategic reasoning! Our agent engages in intense and lengthy dialogues to persuade other players to follow its plans. This was really hard! 1/5
Tweet media one
101
806
4K
@ml_perception
Mike Lewis
6 months
Excited to share a preview of Llama3, including the release of an 8B and 70B (82 MMLU, should be the best open weights model!), and preliminary results for a 405B model (still training, but already competitive with GPT4). Lots more still to come...
18
97
507
@ml_perception
Mike Lewis
6 months
Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
@felix_red_panda
Felix
6 months
Llama3 8B is trained on almost 100 times the Chinchilla optimal number of tokens
Tweet media one
7
7
181
14
37
500
@ml_perception
Mike Lewis
2 years
In anonymous blitz Diplomacy games against humans, our agent finished in the top 10%, doubling the average human score. This project was a big collaboration between experts in NLP, game theory and Diplomacy (too many to tag!). Paper here: 5/5
Tweet media one
12
59
480
@ml_perception
Mike Lewis
4 years
Happy to share MARGE, our new work on rethinking pre-training: given a document, we first retrieve related documents, and then paraphrase these to reconstruct the original. MARGE works well for generation and classification in many languages, sometimes without supervision. (1/6)
Tweet media one
8
92
411
@ml_perception
Mike Lewis
5 years
Excited to share our work on BART, a method for pre-training seq2seq models by de-noising text. BART outperforms previous work on a bunch of generation tasks (summarization/dialogue/QA), while getting similar performance to RoBERTa on SQuAD/GLUE
5
90
379
@ml_perception
Mike Lewis
2 years
@petermilley It's designed to never intentionally backstab - all its messages correspond to actions it currently plans to take. However, sometimes it changes its mind...
16
4
229
@ml_perception
Mike Lewis
2 years
In Diplomacy, 7 players hold private conversations and then make simultaneous moves. The dialogue is used to establish trust and coordinate actions with other players. Here, our agent (green) de-escalates with another player by reassuring them it will not attack them. 2/5
Tweet media one
1
15
218
@ml_perception
Mike Lewis
2 years
Each game, it sends and receives hundreds of messages, which must be precisely grounded in the game state, dialogue history, and its plans. We developed methods for filtering erroneous messages, letting the agent to pass for human in 40 games. Guess which player is AI here... 4/5
Tweet media one
9
14
207
@ml_perception
Mike Lewis
5 years
BART is now ridiculously easy to use. Great work as always by @huggingface !
@huggingface
Hugging Face
5 years
Bored at home? Need a new friend? Hang out with BART, the newest model available in transformers (thx @sam_shleifer ) , with the hefty 2.6 release (notes: ). Now you can get state-of-the-art summarization with a few lines of code: 👇👇👇
Tweet media one
14
208
753
0
38
192
@ml_perception
Mike Lewis
2 years
We automatically labelled training set messages with predicted moves, and use these as control tokens for the LM. During games, the LM is controlled by actions from a planning system. It tries to be helpful - here it (in blue) suggests mutually beneficial moves a human missed 3/5
Tweet media one
3
9
179
@ml_perception
Mike Lewis
1 year
New paper on scaling language models to sequences of a million bytes! MegaByte splits long byte sequences into fixed-size patches (analogous to tokens), then runs a large model between the patches, and a small model to predict each patch byte-by-byte. 1/
@_akhaliq
AK
1 year
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers abs: paper page:
Tweet media one
15
222
1K
4
31
173
@ml_perception
Mike Lewis
6 months
I'm seeing a lot of questions about the limit of how good you can make a small LLM. tldr; benchmarks saturate, models don't. LLMs will improve logarithmically forever with enough good data.
@ml_perception
Mike Lewis
6 months
Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
14
37
500
6
14
173
@ml_perception
Mike Lewis
3 years
Some people still complain about the lack of peer review on arxiv - but I think it’s *great* for science when you can share a method on Thursday, and have independent groups confirm its effectiveness on Friday.
4
14
133
@ml_perception
Mike Lewis
2 years
We know how to train incredible language models, but not how best to generate text with them... Check out @XiangLisaLi2 's new search objective for open ended text generation, Contrastive Decoding, which outperforms existing approaches by a wide margin in human evaluation! (1/4)
@XiangLisaLi2
Xiang Lisa Li
2 years
We propose contrastive decoding (CD), a more reliable search objective for text generation by contrasting LMs of different sizes. CD takes a large LM (expert LM e.g. OPT-13b) and a small LM (amateur LM e.g. OPT-125m) and maximizes their logprob difference
Tweet media one
8
121
714
2
20
125
@ml_perception
Mike Lewis
1 year
The paper is here This "attention sink" trick does allow your LLM work on an unbounded length of input without any degradation in perplexity - but of course that doesn't mean it's really using its full context.
@ctjlewis
Lewis
1 year
if it wasn’t MIT i’d scream bullshit. but?
Tweet media one
32
51
793
1
19
110
@ml_perception
Mike Lewis
5 months
Heading to ICLR! I’m writing fewer papers now to train more Llamas, but proud of our work here: Instruction Backtranslation (), Attention Sinks, () In Context Pretraining () and RA-DIT ().
5
10
123
@ml_perception
Mike Lewis
1 year
New paper showing that Contrastive Decoding (CD) works really well for reasoning tasks, e.g. +6 on GSM8K and +4 on HellaSwag compared to greedy. CD searches for strings that are more likely under a good model than a weak model, emphasizing the improvement from the better model.
@_akhaliq
AK
1 year
Contrastive Decoding Improves Reasoning in Large Language Models paper page: demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box
Tweet media one
2
108
538
2
18
115
@ml_perception
Mike Lewis
2 years
Sparse expert architectures are one of the very few approaches that really convincingly, repeatedly beat baseline transformers on language modelling. Great new survey here from Google!
@LiamFedus
William Fedus
2 years
Our survey on sparse expert models describes the advances over the last decade, discusses some difficulties, and presents our view on promising future areas. Sparsity has been a fun area to work on the last two years. Excited for the models to come.
Tweet media one
7
72
350
0
14
103
@ml_perception
Mike Lewis
4 years
Very excited to be giving a talk on pre-training at my favourite workshop, REPL4NLP! I discuss several recent projects (RoBERTa, BART, kNN-LM, RAG, MARGE), and argue for how we can improve representation learning for language beyond brute force scaling.
@sigrep_acl
SIGREP
4 years
Mike Lewis's ( @ml_perception ) RepL4NLP 2020 keynote talk "Beyond BERT" is now available! 🎉 Here: -- don't forget to join the live Q&A tomorrow at 18:00 PDT!
0
7
37
0
10
89
@ml_perception
Mike Lewis
3 years
Can we stop using parameter counts to measure how large our LLMs are? Pre-training compute budget is a better proxy for scale, but I’d rather we framed it as *good* LMs (measured by e.g. perplexity). If not, I can share the first quadrillion parameter LLM (randomly initialized).
@yoavgo
(((ل()(ل() 'yoav))))👾
3 years
NLPers, when you use "LLM" you mean:
11
4
17
8
6
83
@ml_perception
Mike Lewis
5 years
Improve your language model by converting it into a deep nearest neighbour classifier! The amazing @ukhndlwl pushes SOTA on Wikitext-103 by nearly 3 points, without any additional training (and gets a few other surprising results too).
@ukhndlwl
Urvashi Khandelwal
5 years
Excited to share new work!!! “Generalization through Memorization: Nearest Neighbor Language Models” We introduce kNN-LMs, which extend LMs with nearest neighbor search in embedding space, achieving a new state-of-the-art perplexity on Wikitext-103, without additional training!
Tweet media one
7
161
636
1
15
73
@ml_perception
Mike Lewis
2 years
I did my first PhD Offence today - congratulations Dr Alex Wang!!
@sleepinyourhat
Sam Bowman
2 years
Congrats to @W4ngatang for a successful dissertation defense today!
Tweet media one
Tweet media two
13
1
131
1
1
68
@ml_perception
Mike Lewis
5 years
At NeurIPS to present our work on agents that "think in language" by generating and then executing plans in the form of natural language instructions. Code and a large new dataset are online!
@AIatMeta
AI at Meta
5 years
Researchers from Facebook AI have developed a method that teaches AI to plan by using natural language, and are releasing the real-time strategy game they used to train and evaluate this approach.
Tweet media one
1
86
261
0
10
63
@ml_perception
Mike Lewis
1 year
The best thing about "advising" people is how much smarter they make you. Thanks @OfirPress and all the rest of you!
@OfirPress
Ofir Press
1 year
@nlpnoah Mike @ml_perception is the collaborator that I spent the most hours with during my PhD and I couldn't be more grateful for that The best compliment I ever got was when @Tim_Dettmers told me he thinks Mike's thinking has become 'simpler' because of his time working with me
Tweet media one
1
0
20
1
0
46
@ml_perception
Mike Lewis
6 months
@yoavgo I understand where this is coming from, but we will release a proper research paper when it's ready. I do think it's useful to share these models as soon as possible though!
4
0
61
@ml_perception
Mike Lewis
4 months
Excited to see the open source release of FAIR's early fusion multimodal LLMs!
@AIatMeta
AI at Meta
4 months
Today is a good day for open science. As part of our continued commitment to the growth and development of an open ecosystem, today at Meta FAIR we’re announcing four new publicly available AI models and additional research artifacts to inspire innovation in the community and
98
527
2K
0
4
47
@ml_perception
Mike Lewis
4 years
Virtually presenting in the next two ICLR sessions! Instead of training models with billions of parameters, can we instead have a smaller model with large explicit memory? @ukhndlwl certainly sounds excited :-)
0
4
44
@ml_perception
Mike Lewis
2 years
Jason Weston and I are looking to co-host a postdoc at FAIR! The topic is flexible within NLP as long as everyone's excited. Apply at:
@jaseweston
Jason Weston
2 years
NLP Postdoc opportunity at FAIR! Mike Lewis ( @ml_perception ) and I are seeking to co-mentor. Apply here:
2
19
104
0
5
43
@ml_perception
Mike Lewis
3 years
Check out @OfirPress ’s new approach to representing positions in transformer language models: it’s simple, fast, has good inductive bias, and lets you test on longer sequences than you trained on!
@OfirPress
Ofir Press
3 years
Since Transformer LMs were invented, we’ve wanted them to be able to read longer inputs during inference than they saw during training. Our Attention with Linear Biases enables this, in very few lines of code, without requiring extra params or runtime 🧵⬇
Tweet media one
8
158
666
0
0
35
@ml_perception
Mike Lewis
11 months
Ari is an incredibly creative researcher and a great mentor - I can't wait to see what comes out of his lab!
@universeinanegg
Ari Holtzman really enjoyed COLM
11 months
If you want a respite from OpenAI drama, how about joining academia? I'm starting Conceptualization Lab, recruiting PhDs & Postdocs! We need new abstractions to understand LLMs. Conceptualization is the act of building abstractions to see something new.
14
63
277
1
1
15
@ml_perception
Mike Lewis
5 years
New pre-trained models can write great looking summaries, but ROUGE scores don't penalize them for telling lies. @W4ngatang has a new metric that correlates much better with factuality.
@W4ngatang
Alex Wang
5 years
Our new work "Asking and Answering Questions to Evaluate the Factual Consistency of Summaries" does exactly that. We use question generation and question answering models to evaluate whether summaries are factually consistent w/ the source text.
Tweet media one
4
39
168
1
2
26
@ml_perception
Mike Lewis
5 years
I found the summarization performance surprisingly good - BART does seem to be able to combine information from across a whole document with background knowledge to produce highly abstractive summaries. Some typical examples beneath:
Tweet media one
2
4
25
@ml_perception
Mike Lewis
3 years
Great work from DeepMind showing just how well expert-parallel transformers scale up
@_akhaliq
AK
3 years
Unified Scaling Laws for Routed Language Models abs:
Tweet media one
1
13
67
0
4
23
@ml_perception
Mike Lewis
1 year
As @karpathy says, tokenization causes *so many* problems! Another is weird effects on nucleus sampling, depending on whether a rare word gets split into frequent tokens or not. I hope MegaByte can help make future LLMs tokenization-free 4/4
@karpathy
Andrej Karpathy
1 year
Promising. Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long, so the devil is in the details. Tokenization means that LLMs are not actually fully end-to-end. There is a whole separate stage with
90
599
4K
1
3
21
@ml_perception
Mike Lewis
2 years
Weiyan is amazing, and made absolutely critical contributions to the Diplomacy project! She's looking for an academic job, and you should probably hire her...
@shi_weiyan
Weiyan Shi
2 years
Super excited to announce our @ScienceMagazine paper on Cicero, the first human-level negotiation AI agent with natural language in the 7-player Diplomacy game. Link: I am on the academic job market for faculty jobs! I work on NLP!
4
9
66
1
2
22
@ml_perception
Mike Lewis
4 years
I'll be answering some questions at 18.00PDT (in 1 hour) - if you have thoughts about pre-training, please come along (whether or not your watched the talk). Also, definitely watch @ev_fedorenko 's keynote, it's been a long time since I learnt so much from a talk!
@sigrep_acl
SIGREP
4 years
Mike Lewis's ( @ml_perception ) RepL4NLP 2020 keynote talk "Beyond BERT" is now available! 🎉 Here: -- don't forget to join the live Q&A tomorrow at 18:00 PDT!
0
7
37
3
6
19
@ml_perception
Mike Lewis
2 years
Excited to be giving a talk about the language models behind our Cicero Diplomacy agent at the Transfer Learning workshop (2.45, Theater C)!
@AlbalakAlon
Alon Albalak
2 years
11:30am-12:30pm CST - @sarahookr & @kchonyc "Debate on Transfer Learning" 2-2:45pm - @davlanade "Cross-lingual Transfer for Named Entity Recognition: A study on African Languages" 2:45-3:30 - @ml_perception "Training Language Models to Negotiate in the Game of Diplomacy" 2/3
1
0
6
1
2
19
@ml_perception
Mike Lewis
4 years
@chrmanning @harm_devries @DBahdanau Obviously there's a place for datasets that reflect real world usage, but I suspect most natural distributions have a fat head that's now quite easy for us to model. We should also design datasets that target all the weird edge cases and weaknesses of our current models.
1
4
18
@ml_perception
Mike Lewis
4 years
@_rockt Wow, imagine how trivial it must be to make RL work in a language-only environment - that's only 1D!
1
0
18
@ml_perception
Mike Lewis
4 years
Really excited about this workshop!
@colinraffel
Colin Raffel
4 years
📣 Announcing the ICLR 2021 Workshop on Enormous Language Models 📣 We have an incredible speaker lineup that covers building, evaluating, critiquing, and improving large LMs, as well as a collaborative parcipant-driven benchmark and 2 panels! More info:
6
47
249
0
0
18
@ml_perception
Mike Lewis
5 years
This does seem closely related to T5 from Google last week. I haven't read that in detail yet, but it seems like we use a slightly different pre-training objective, and better results for the same model size. We haven't tried training an 11B parameter model yet, though :-)
1
0
16
@ml_perception
Mike Lewis
2 years
If you'd like to get a sense for the strength of our Diplomacy agent, check out this detailed video from an expert human playing 6 bots!
0
0
16
@ml_perception
Mike Lewis
4 years
Unlike masked language models, this pre-training objective is closely related to several end tasks, such as summarization, retrieval and translation (e.g. BLEU scores of 35 with the raw pre-trained model). Let's build models that can do more tasks with less supervision! (2/6)
1
0
15
@ml_perception
Mike Lewis
2 years
Real world application for negotation AI!
@jbrowder1
Joshua Browder
2 years
Here it is! The first ever Comcast bill negotiated 100% with A.I and LLMs. Our @DoNotPay ChatGPT bot talks to Comcast Chat to save one of our engineers $120 a year on their Internet bill. Will be publicly available soon and work on online forms, chat and email.
191
1K
11K
0
1
12
@ml_perception
Mike Lewis
1 year
@ykilcher The "global" tokens in BigBird are quite different from attention sinks, because global tokens aggregate info across *all* tokens in a non-causal LM. Attention sinks see *no* other tokens (even their own token is irrelevant!), and capture null attention results in causal models.
0
0
13
@ml_perception
Mike Lewis
4 years
To my knowledge, this is the first competitive alternative to variants of masked language modelling. I hope this work both encourages exploration of other alternatives to MLMs, and leads to better understanding of what pre-training is really learning. (5/6)
1
0
13
@ml_perception
Mike Lewis
1 year
We test MegaByte on character level language modeling (big gains over prior byte-level models, competitive with subwords), byte-level image modelling (SOTA results) and modelling audio from raw files (no pre-processing!). MegaByte should really simplify multi-modal modelling! 3/
Tweet media one
1
0
13
@ml_perception
Mike Lewis
1 year
There's lots of work on efficient attention (MegaByte is O(n^4/3)), but the real problem with modeling long sequences is the use of big feedforwards per-position. Patchifying allows larger layers shared over multiple bytes, which focus on the hard decisions within each patch. 2/
1
1
11
@ml_perception
Mike Lewis
4 years
By retrieving relevant facts during pre-training, the model can focus on learning to paraphrase, rather than memorizing encyclopedic knowledge. This approach also seems to make MARGE somewhat less prone to hallucinating facts when generating. (3/6)
1
1
11
@ml_perception
Mike Lewis
2 years
@FelixHill84 @kchonyc Sorry we didn't cite in the BART paper - I would have if I'd seen it (I wasn't really following sentence embedding work). It's interesting how early most of the key pre-training ideas came, yet it still took a couple more years to put them together in the right way.
1
0
9
@ml_perception
Mike Lewis
1 year
It bothers me that we use nucleus sampling for long form output, and greedy decoding when we want a single right answer. Combined with @XiangLisaLi2 's results last year for open ended generation (), Contrastive Decoding is now shown to work great for both.
Tweet media one
1
0
8
@ml_perception
Mike Lewis
2 years
Please express all critiques as limericks! However, much of our Diplomacy work involved giving us precise control over a LM, so that the agent's messages corresponded to its intended actions - which seems in keeping with the alignment agenda.
@BellaRudd1
Bella Rudd
2 years
limerick? it looks like our plans for alignment might need just a little refinement! let's hope that this guy won't give this ai a paperclip-making assignment!
0
3
39
0
0
8
@ml_perception
Mike Lewis
5 years
Joint work with @YinhanL Naman Goyal @gh_marjan Abdelrahman Mohamed, @omerlevy_ @vesko_st @LukeZettlemoyer
1
1
7
@ml_perception
Mike Lewis
1 year
Contrastive Decoding is easy to implement and cheap to run, give it a try! There are several interpretations of why it works, one is that the stronger model is constructing an informative string to communicate to its mental model of a listener (the weak model).
Tweet media one
1
0
7
@ml_perception
Mike Lewis
6 months
@soldni @natolambert @rosstaylor90 The blog is correct (sigh...)
2
0
7
@ml_perception
Mike Lewis
4 years
This auto-encoder framework simplifies the retrieval problem (compared to related recent models like REALM and RAG), allowing us to train the retrieval and reconstruction models jointly from a single objective and a random initialization. (4/6)
2
0
7
@ml_perception
Mike Lewis
4 years
@b_niranjan BART (in Fairseq and Huggingface Transformers) is designed to be good at this kind of task.
1
0
6
@ml_perception
Mike Lewis
5 years
@timohear Thanks! Code and pre-trained models will be available very soon.
0
0
6
@ml_perception
Mike Lewis
3 years
@nsaphra Zero-shot/fine-tuned task performance would also be fine with me for measuring the quality of the model. But really, anything except parameter counts would be progress…
0
0
6
@ml_perception
Mike Lewis
2 years
Instead, we propose searching for strings that are much more likely with a good LM than a weaker LM. These strings show interesting behaviours that the stronger LM has learnt but the weaker one hasn't (other interpretations in the paper). (4/4)
Tweet media one
0
0
5
@ml_perception
Mike Lewis
4 years
@OfirPress has a gift for finding 2-line code changes that make a big difference.
@OfirPress
Ofir Press
4 years
Everyone thinks that you have to increase the input length of language models to improve their performance. Our new Shortformer model shows that by *shortening* inputs performance improves while speed and memory efficiency go up. ⬇(1/n) (code below)
Tweet media one
8
89
545
0
0
5
@ml_perception
Mike Lewis
1 year
This work was done by the amazing Sean O'Brien during his residency at FAIR. Look out for more great things from him in the future!
0
0
4
@ml_perception
Mike Lewis
3 years
@gneubig @complingy @tallinzen @ReviewAcl Very anecdotally, whatever NeurIPS did this year seemed to give much better automatic suggestions for reviewers.
0
0
4
@ml_perception
Mike Lewis
4 years
@hllo_wrld @huggingface @SanhEstPasMoi How small do you want? I have a 140M parameter version I can share?
1
0
4
@ml_perception
Mike Lewis
3 years
@kroscoo Yeah, my point is that there are lots of ways to increase parameter counts without proportionally improving quality (mixture of experts layers is another obvious one). I think it’s pretty rare that parameter counts are even worth reporting.
0
0
4
@ml_perception
Mike Lewis
2 years
@sivil_taram Sorry for the pain here! The problem was caused by an unfortunate config bug in the original BART-large training run, which caused decoder sequences to start with </s> <s> … If you’re seeing issues, try setting these tokens during fine tuning and generation. Hope that helps!
1
0
4
@ml_perception
Mike Lewis
5 months
@OfirPress I should probably apply for an internship with you!
1
0
4
@ml_perception
Mike Lewis
5 years
@srush_nlp I also remember spending a couple of days trying and failing to get a non-embarrassing number on WikiText 2...
0
0
3
@ml_perception
Mike Lewis
3 years
@gneubig I know of a few other attempts at directions like this, but it's really hard to avoid hurting perplexity - great work!
0
0
3
@ml_perception
Mike Lewis
4 years
@ale_suglia @huggingface Given all the magical things Transformers can learn without supervision, I'm pretty sure they can figure out segments for themselves!
0
0
3
@ml_perception
Mike Lewis
4 years
@sleepinyourhat Well, GPT-3's few shot learning would work a lot better if it had solved language understanding - but I think that kind of set up is a much better test than supervised training on large datasets.
0
0
3
@ml_perception
Mike Lewis
5 years
@nsaphra BERT suggests team/group/trio. We've finally found a task it sucks at...
0
0
3
@ml_perception
Mike Lewis
9 months
@ssgrn @Meta Congrats!! So excited to work with you again!
0
0
2
@ml_perception
Mike Lewis
1 year
@delliott MegaByte scales to very long sequences, so I doubt there's a big problem here (BPE might be worse than UTF-8 for unevenly encoding languages). If you’re training a model for a language with this issue, increasing patch size while decreasing byte embed size would make it ~free.
2
0
3
@ml_perception
Mike Lewis
2 years
@cbrew Thanks! We did play a bunch of internal games with a mix of humans and bots, and sometimes it took a surprisingly long time to identify the bots (even knowing what to look for).
1
1
2
@ml_perception
Mike Lewis
4 years
@VeredShwartz What's wrong with just taking the product of all the token probabilities (including the end-of-sentence symbol)? If your language model is good, it should prefer typical length sequences from its training data.
1
0
2
@ml_perception
Mike Lewis
3 years
@deviparikh I love how this tweet is illustrated with a calendar that accounts for almost every hour of every day :-)
1
0
2
@ml_perception
Mike Lewis
2 years
@paul_scharre @HaydnBelfield @DavidSKrueger @MichaelD1729 @polynoamial Cicero's modular design means that deception can’t emerge (in the sense you mean), and also that we can often understand its decisions. Apparent deception happens either when it changes its mind after sending a message, or from failing to accurately describe its plan in language.
1
0
2
@ml_perception
Mike Lewis
4 years
@_julianmichael_ @VeredShwartz Prove me wrong, but I'd be amazed if strong MT models put significant probability mass on the empty string (if it never happens in the training set). Even if LMs aren't good enough yet, let's put our effort into making better ones, instead of hacking around their limitations.
2
0
2
@ml_perception
Mike Lewis
4 years
@yoavgo Off the top of my head: zero/few shot learning, new pre-training methods that work well in that setting, and naturally occurring train/test sets (or maybe adversarial ones).
2
0
2
@ml_perception
Mike Lewis
5 years
@yurakuratov Code/models will be available soon!
0
0
2
@ml_perception
Mike Lewis
1 year
@peterjliu At least part of it is because there's a book in validation which is way out of distribution (mentioned here: ). Valid and test also only contain 50/100 books respectively, which might explain why relative orderings of models are different on valid/test.
1
0
2
@ml_perception
Mike Lewis
4 years
@ThomasScialom @OpenAI I haven't seen that, but it would be interesting! You might also be interested in MARGE, which uses an unsupervised pre-training objective that is closely related to several downstream tasks, and works really well zero-shot in some cases:
0
0
2
@ml_perception
Mike Lewis
4 years
@sleepinyourhat I *think* the lawyers would let us train a model for research purposes and publish the result, but the no-derivatives bit would stop us releasing the model.
0
0
2
@ml_perception
Mike Lewis
5 years
@annargrs Let's say not ACL-worthy. 75-80% of all submissions are rejected, so even if we're completely balanced then a big majority of dataset papers won't get in. You suggest some criteria that reviewers shouldn't use to reject papers, but what do you think should they be using?
1
0
1
@ml_perception
Mike Lewis
2 years
@ryandcotterell It's actually named after the Pecsaetan tribe, who used to inhabit the area.
0
0
1
@ml_perception
Mike Lewis
4 years
@amitness The former.
0
0
1
@ml_perception
Mike Lewis
4 years
1
0
1
@ml_perception
Mike Lewis
4 years
@yoavgo But I don't know of datasets with a 100k crowdsourced (non adversarial) examples where models don't get superhuman accuracy. If there are any, we can just pre-train Roberta a bit longer...
1
0
1
@ml_perception
Mike Lewis
4 years
@yoavgo Yeah, I should've added that I meant as a research problem (because it works). Clearly these datasets have been incredibly important, and have driven lots of progress - but if they're not useful to researchers in the future, are they worth teaching? I wouldn't know.
1
0
1
@ml_perception
Mike Lewis
2 years
Sampling from LMs avoids this problem, but can introduce errors from unlucky samples, leading to truncated sampling schemes like topk or nucleus. But sampling only aims to produce average quality strings from the LM's distribution, not the best ones. (3/4)
1
0
1
@ml_perception
Mike Lewis
2 years
Searching for the most likely strings gives short, repetitive or boring output. Intuitively - for any long/interesting string, there are a huge number of slight variations of similar quality, so no individual string has high likelihood (even if they're collectively likely). (2/4)
1
0
1
@ml_perception
Mike Lewis
4 years
@ukhndlwl @MSFTResearch Well deserved, congrats!!
0
0
1