Shayne Longpre @ShayneRedford Twitter profile

Pinned Tweet

Shayne Longpre

2 months

Excellent breakdown by @kevinroose @nytimes of the recent shifts in web norms, and the consent to use its data for AI.

The Data That Powers A.I. Is Disappearing Fast

New research from the Data Provenance Initiative has found a dramatic drop in content made available to the collections used to build artificial intelligence.

www.nytimes.com

4

12

24

Last Seen Profiles

@ITFCAnalytics

@naibatsukarisma

@LissettKna

@stw_pdg

@Cheap_Knockoff

@RezDZN_

@levis_legs

@TeaJunkie1

@ArjunNMurti

@Palsiiiii

@rctvenlinea

@noseforthenews6

@cee_cynthia6

@LilymaeHet16069

@Othellobeats_

@sezginguler_

@VmaniakJ

@eshoez

@O_PaletteParade

@AsSa66986762

@bokeplokalmalam

@KyuYongEom

@ya71742

@cukienaknikmati

@LawsonUBS

@rabbitanksdvx

@spartiate72815

@poshoeledeu

@zzr8pgznmi

@Steezy_Yummy

@AprilArmst35991

@sadgirl_causeyu

@satoo_t7s

@pebsandchance

@SheikYerbutti2

@ByEdStrong

Shayne Longpre

@ShayneRedford

2 months

✨New Preprint ✨ How are shifting norms on the web impacting AI? We find: 📉 A rapid decline in the consenting data commons (the web) ⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic) ⛔️ Robots.txt preference protocols

10

94

256

Shayne Longpre

@ShayneRedford

2 years

✨New Paper✨What’s the best completely public competitor to #ChatGPT ? Flan-T5 beats all public models we tested: Flan-T5 3B ▶️ T0++ 3B ▶️ OPT-IML 175B ▶️ GLM-130B ▶️ Flan 2021 3B ▶️ NIv2 3B We release the @GoogleAI 🌟Flan Collection🌟data + methods for Instruction Tuning! 1/

24

249

1K

Shayne Longpre

@ShayneRedford

7 months

New Resource: Foundation Model Development Cheatsheet for best practices We compiled 250+ resources & tools for: 🔭 sourcing data 🔍 documenting & audits 🌴 environmental impact ☢️ risks & harms eval 🌍 release & monitoring With experts from @AiEleuther , @allen_ai ,

4

148

624

Shayne Longpre

@ShayneRedford

9 months

✏️ AI Terminology Updates 2022 ➡️ 2024 NLP ➡️ "language modeling" Multi-task training ➡️ instruction tuning Finetuning ➡️ "post-training" Semantic parsing ➡️ API tool use Robustness ➡️ Red teaming Train/test split ➡️ train/train split Transfer learning ➡️ "it's already in the

21

80

573

Shayne Longpre

@ShayneRedford

11 months

📢Announcing the🌟Data Provenance Initiative🌟 🧭A rigorous public audit of 1800+ instruct/align datasets 🔍Explore/filter sources, creators & license conditions ⚠️We see a rising divide between commercially open v closed licensed data 🌐: 1/

10

148

462

Shayne Longpre

@ShayneRedford

1 year

We're releasing our 2nd lecture playlist for Evaluating Generative AI/LLMs, taught at @MIT . Intro to 3 challenge areas: 1⃣ Bias & Toxicity 2⃣ Hallucination & Factuality 3⃣ Robustness & Consistency 📽️: 1/🧵

Evaluating Language Models

Probes three areas where language models are known to have challenges: bias/toxicity, hallucination/factuality, and robustness/consistency.

www.youtube.com

2

88

388

Shayne Longpre

@ShayneRedford

2 years

📢 A 🧵on the future of NLP model inputs. What are the options and where are we going? 🔭 1. Task-specific finetuning (FT) 2. Zero-shot prompting 3. Few-shot prompting 4. Chain of thought (CoT) 5. Parameter-efficient finetuning (PEFT) 6. Dialog [1/]

10

83

375

Shayne Longpre

@ShayneRedford

1 year

#NewPaperAlert When and where does pretraining (PT) data matter? We conduct the largest published PT data study, varying: 1⃣ Corpus age 2⃣ Quality/toxicity filters 3⃣ Domain composition We have several recs for model creators… 📜: 1/ 🧵

12

88

360

Shayne Longpre

@ShayneRedford

2 years

🔭 A 🧵 on @OpenAI LLM "Alignment" (e.g. #ChatGPT ) Q: How does this differ from publicly available "Instruction Tuning" (IT)? A: Proprietary Alignment is actually 3 separate components: 1⃣ Instruction tuning 2⃣ ➕ Open-ended generation/creative prompts 3⃣ ➕ Human feedback 1/

2

48

250

Shayne Longpre

@ShayneRedford

7 months

Independent AI research should be valued and protected. In an open letter signed by over a 100 researchers, journalists, and advocates, we explain how AI companies should support it going forward. 1/

7

78

229

Shayne Longpre

@ShayneRedford

1 year

Excited to present the 🥮Flan Collection🥮 poster at #ICML2023 tomorrow. Come say hello at Exhibition Hall 1, #130 ! A highlight 🧵on Flan-PaLM 🌴, Flan-T5 and Flan data in research and production. 1/

2

40

225

Shayne Longpre

@ShayneRedford

2 years

📢 A 🧵 on the Trends in NLP Datasets. What’s changed since SQuAD was all the rage in 2016? A: A LOT. 🔭 1. Generic ➡️ Niche Tasks 2. Task-specific Training+Eval ➡️ Eval Only 3. Dataset ➡️ Benchmark ➡️ Massive Collections 4. Datasets ➡️ Diagnostics 1/

4

46

224

Shayne Longpre

@ShayneRedford

1 year

A wave of new work shows how **brittle** "Alignment"/RLHF safety methods are. ⛓️ Prompt jailbreaks are easy 🚂 Finetuning away safety (even #OpenAI API) is simple and likely undetectable 🤖 LLMs can auto-generate their own jailbreaks... 1/ 🧵

7

44

207

Shayne Longpre

@ShayneRedford

2 years

I started compiling training, inference, and data accessibility for the major LLMs. 1⃣ Is OSS? Has Playground? API? 2⃣ Is pretraining data open? searchable? 3⃣ Links, notes, etc Please take a look, and LMK whats missing! 1/

Major LLMs + Data Availability

docs.google.com

6

35

167

Shayne Longpre

@ShayneRedford

3 months

📢 Excited to announce the expanded 🌟Responsible Foundation Model Development Cheatsheet🌟 ➡️ A Survey & Review of Tools🛠️& Resources🧮🗝️📚. ➡️ We ask what1⃣what responsible practices developers can adopt, &2⃣ what tools are missing, misused, or under-used in the AI dev

1

46

133

Shayne Longpre

@ShayneRedford

10 months

🚨OpenAI publicly revoked ByteDance’s API access for training on API data. What ripple effects could come from this precedent? We already know synthetic API-generated data has grown massively: Many startups depend on it… 1/

The Data Provenance Initiative: A Large Scale Audit of Dataset...

The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these...

arxiv.org

5

17

127

Shayne Longpre

@ShayneRedford

6 months

A 🧵 on my favorite, influential works on "Data Measurements" 🚂 Datasets drive AI progress 📚 But... massive datasets remain impenetrable & poorly understood for *years* 🔍 Data forensics uncover their mysteries 1/

1

30

128

Shayne Longpre

@ShayneRedford

2 years

📢 For those who missed @naacl , here’s a selection of papers I really enjoyed! 📚 Mainly multi-linguality, prompting, fairness and ethics, but some other ideas too! #NAACL2022 🧵 1/

1

20

122

Shayne Longpre

@ShayneRedford

5 months

🔭 New perspective piece at #ICML2024 & @MIT AI Impact Award winner🎉 🌟Data Authenticity, Consent, and Provenance for AI Are All Broken: What Will It Take to Fix Them?🌟 w/ @RobertMahari @naana_om @wwbrannon @TobinSouth @katyilonka @alex_pentland @jad_kabbara 🔗:

10

37

121

Shayne Longpre

@ShayneRedford

2 years

Too many new papers to read this week? 📚 See the key take-aways of our new Flan-PaLM work in a 7⃣ minute video. ➕Scaling to 540B model ➕Scaling finetuning to 1.8K tasks ➕CoT tuning ➡️ 🌟MMLU SOTA 75.2% 🌟Better usability 🌟Better CoT Reasoning 🌐:

Scaling Instruction-Finetuned Language Models - Video Summary

A summary of key takeaways from our paper “Scaling Instruction-Finetuned Language Models”.Paper Link: https://arxiv.org/abs/2210.11416Models: https://hugging...

www.youtube.com

3

28

118

Shayne Longpre

@ShayneRedford

1 year

This semester my @CCCatMIT co-instructors and I taught #MIT 's first post- #ChatGPT Generative AI course, covering: ➡️Uses and new abilities ➡️LM Evaluation ➡️AI-mediated communication ➡️Societal challenges 📜 Syllabus + reading list 📚: 1/

2

18

115

Shayne Longpre

@ShayneRedford

2 years

Sharing my *rough* slides from a @CCCatMIT February reading group. Covers "NLP Training Trends for Large Language Models" (LLM) and a survey of 4 new interesting papers: FLAN, T0, ExT5, MetaICL! 📚: [1/6]

2

23

113

Shayne Longpre

@ShayneRedford

3 years

📢📜 #NLPaperAlert 🌟Knowledge Conflicts in QA🌟- what happens when facts learned in training contradict facts given at inference time? 🤔 How can we mitigate hallucination + improve OOD generalization? 📈 Find out in our #EMNLP2021 paper! [1/n]

4

16

114

Shayne Longpre

@ShayneRedford

2 years

Best feeling: when you find out your work was presented in the 500+ person grad class you once took! 💪💪 Stanford CS224N – NLP w/ Deep Learning 📽️: #phdlife #NLProc

1

15

106

Shayne Longpre

@ShayneRedford

11 months

📢 We are expanding the instruct/align datasets in the 🌟Data Provenance Collection🌟 Are there any great/new ones not covered? Available at:

8

29

105

Shayne Longpre

@ShayneRedford

1 year

We can now see why @OpenAI allegedly uses MoE models -- stronger performance📈 for fewer FLOPs 🚀. Awesome work led by @shengs1123 evaluates instruction tuned MoE models

Sheng Shen

@shengs1123

1 year

A Winning Combination for Large Language Models TL;DR: Did you find MoE models generalize worse than dense models on downstream tasks? Not any more at the age of instruction tuning! Surprisingly, we see the “1 + 1 > 2” effect when it comes to MoE + Instruction Tuning. [1/4]

8

65

375

3

14

98

Shayne Longpre

@ShayneRedford

1 year

The Flan Collection is now available for direct download. 🌐: Thank you to the prolific @EnricoShippole for generating 10s of GBs of tasks, and @Hou_Le for several bug fixes & improvements!

1

12

99

Shayne Longpre

@ShayneRedford

10 months

Excited to moderate 2 fantastic panels today with @yizhongwyz and @qinyuan_ye at the #NeurIPS2023 @itif_workshop Instruction Workshop! Looking forward to chatting with you there! Schedule:

1

13

51

Shayne Longpre

@ShayneRedford

1 year

Recorded the 2nd part of my talk at @databricks , covering Results & Take-Aways from training Flan-T5 and Flan-PaLM 🍮 📽️: (11m) ➡️ In short, how far can we get just with academic datasets (w/o ChatGPT distillation or human preferences)? 1/

Effective Instruction Tuning: Results & Take-Aways

Part 2 of the "Effective Instruction Tuning: Data, Methods, and Emergent Abilities" talk given at Databricks on April 20, 2023.

www.youtube.com

1

16

94

Shayne Longpre

@ShayneRedford

3 months

Super appreciative of the recognition from #NAACL2024 — our Pretrainer’s Guide won an 🌟Outstanding Paper Award🌟🏆 This was a year long analysis into pretraining age, quality & toxicity data filters. Gratitude to our team 🙏🏼 @gyauney @emilyrreif @katherine1ee @ada_rob

Shayne Longpre

@ShayneRedford

1 year

#NewPaperAlert When and where does pretraining (PT) data matter? We conduct the largest published PT data study, varying: 1⃣ Corpus age 2⃣ Quality/toxicity filters 3⃣ Domain composition We have several recs for model creators… 📜: 1/ 🧵

12

88

360

12

19

95

Shayne Longpre

@ShayneRedford

4 years

There are few options for evaluating multilingual question answering outside of English, and especially few for open domain question answering. Yi Lu, @jodaiber , and I are excited to release an Open QA evaluation set, spanning 26 languages. [1/n]

1

35

86

Shayne Longpre

@ShayneRedford

2 months

Headed to 🛬🇦🇹 Vienna #ICML2024 Reach out if you'd like to chat or catch up! Work together w/ collaborators: - A Safe Harbor for AI Evaluation ⛴️() -- Tuesday 10:30 am Oral - On the Societal Impact of Open Foundation Models () --

On the Societal Impact of Open Foundation Models

Foundation models are powerful technologies: how they are released publicly directly shapes their societal impact. In this position paper, we focus on open foundation models, defined here as those...

arxiv.org

9

11

92

Shayne Longpre

@ShayneRedford

1 year

Recorded the 1st part of my talk on the Data and Methods used in the Flan Collection 🍮 (accepted to @icmlconf Honolulu 🏝️) 📽️: (12m) Thanks again to @vagabondjack and @databricks for inviting me! 1/

Effective Instruction Tuning: Data & Methods

Part 1 of the "Effective Instruction Tuning: Data, Methods, and Emergent Abilities" talk given at Databricks on April 20, 2023.

www.youtube.com

2

9

78

Shayne Longpre

@ShayneRedford

2 years

What dates📅 can @OpenAI , @AnthropicAI , @CohereAI models reliably answer questions for?🔭 I binary-search through "future" Wiki events to find out. Results ❌🟰❌documentation: #GPT4 ➡️~Dec 19 ('21) #ChatGPT ➡️~Oct 24 Claude v1.2➡️~Oct 10 Cohere XL Nightly➡️~Apr 24 ('22) 1/🧵

3

11

79

Shayne Longpre

@ShayneRedford

3 years

📢📜 #NLPaperAlert 🌟Active Learning over Multiple Domains in NLP🌟 In new NLP tasks, OOD unlabeled data sources can be useful. But which ones? We try active learning ♻️, domain shift 🧲, and multi-domain sampling🔦 methods to see what works [1/]

2

18

74

Shayne Longpre

@ShayneRedford

2 years

🔭 How to reduce #LLM generation toxicity/bias? I'm surprised this finding hasn't received any attention: Instruction Tuning (e.g. Flan, T0) reduces toxic generations A LOT ✨ w/o any Human Feedback ✨. ➡️ I.e. #ChatGPT -esque Human values alignment w/o human feedback. 1/

4

14

74

Shayne Longpre

@ShayneRedford

1 month

Honored for the Data Provenance Initiative to be awarded the Infrastructure Grant Award, by @mozilla ! 🎉🎉🎉 As part of this grant, we were invited to present at MozFest House Amsterdam, where we gave an early look at trends in the AI data supply chain: 📽️

7

13

74

Shayne Longpre

@ShayneRedford

10 months

Awesome round-up on synthetic data by @natolambert ! Related: we found a massive increase in synthetic data use in 2023, enabling training on more diverse tasks w/ longer outputs. 🔗:

Nathan Lambert

@natolambert

10 months

The mega-post on synthetic data. I leave no stone unturned: * What synthetic data is for LLMs * Instructions vs preferences vs critiques * Constitutional AI review & insight * Mistral & other's use in pretraining * OpenAI Superalignment * Examples in the open * other topics

12

88

531

2

9

72

Shayne Longpre

@ShayneRedford

1 year

This framework is very powerful. Forget Langchain, if you want to create a pipeline of modules I would try DSPy (e.g. chaining multi-hop retrieval, w/ APIs, your own models, and promoting techniques)

Omar Khattab

@lateinteraction

1 year

🚨Announcing 𝗗𝗦𝗣𝘆, the framework for solving advanced tasks w/ LMs. Express *any* pipeline as clean, Pythonic control flow. Just ask DSPy to 𝗰𝗼𝗺𝗽𝗶𝗹𝗲 your modular code into auto-tuned chains of prompts or finetunes for GPT, Llama, and/or T5.🧵

24

138

644

0

5

70

Shayne Longpre

@ShayneRedford

1 month

📢 Excited to see our piece the "Data Provenance Initiative: A large-scale audit of dataset licensing and attribution in AI" now in: 📜 @Nature Machine Intelligence ➡️ 🗞️ @MIT News ➡️ 1/

Study: Transparency is often lacking in datasets used to train large language models

The Data Provenance Explorer can help machine-learning practitioners make more informed choices about the data they train their models on, which could improve the accuracy of models deployed in the...

news.mit.edu

3

17

69

Shayne Longpre

@ShayneRedford

2 years

Take a break from Twitter politics and checkout some new LLM responses! 🍰🌴 FLAN-PaLM 🍰🌴 Response of the Day #1 A pretty nice example of Complex Verbal Reasoning

5

6

64

Shayne Longpre

@ShayneRedford

1 year

🔭 Use instruction tuning 🛠️ for better results 🎯 and reduced compute costs 🍃. A 🧵 Q: Communally what is costlier, finetuning (FT) or pretraining (PT)? A: FT. Why? 1⃣Top 50 NLP models are 70% all downloads ( @huggingface ) 2⃣50 PTs ➡️ 290M+ FTs (assuming 1 FT/download) 1/

1

12

63

Shayne Longpre

@ShayneRedford

2 years

🚨 Update: Super excited to work with @barret_zoph and @_jasonwei this summer as a @GoogleBrain Student Researcher. Feeling fortunate to work with such accomplished ML/NLP researchers and an awesome team!

3

59

Shayne Longpre

@ShayneRedford

2 years

Just diagnosed with a severe case of ChatGPT Twitter fatigue.

3

2

59

Shayne Longpre

@ShayneRedford

3 years

Some personal news. Heading to Boston this fall to start my MS/PhD at with Prof. Deb Roy at the @medialab @MIT . Excited to apply NLP to social + civic challenges — please reach out if you have research ideas/readings! [1/3]

3

4

58

Shayne Longpre

@ShayneRedford

2 years

➡️ It's promising these results don't use any RLHF data, or human "alignment", which is expensive to collect and less publicly available. We hope this release supports the open source community, and improves instruction tuning methods and research! 7/

4

2

56

Shayne Longpre

@ShayneRedford

1 year

Recorded the final part of my talk at @databricks , covering Data Selection Trade-Offs for Instruction Tuning 📽️: (10m) Trade-offs: 1⃣ Permissive vs Restrictive Licensing 2⃣ Traditional NLP performance vs creative generation Are these correlated? 1/

2

11

55

Shayne Longpre

@ShayneRedford

2 months

Was a little nervous for this, but grateful to discuss our work with @juleshyman on @YahooFinance ! @RobertMahari and I discuss emerging challenges for data creators and developers in the AI supply chain—part of the Data Provenance Initiative recent study:

Yahoo Finance

@YahooFinance

2 months

The content used to train AI models used to be plentiful on the internet, but now the sources of that data are cracking down on who has access, according to a study. @ShayneRedford and @RobertMahari discuss how this could impact AI moving forward:

4

7

21

5

6

54

Shayne Longpre

@ShayneRedford

2 years

📢 #NLPaperAlert 🌟Active Learning Over Multiple Domains in Natural Language Tasks 🌟 To appear at Workshop on Distribution Shifts (DistShift) at #NeurIPS2022 later this week! 📜: This is honestly the hardest ML problem I’ve ever worked on. 👉🧵

2

13

53

Shayne Longpre

@ShayneRedford

9 months

ByteDance v OpenAI⚠️, LAION-5B CSAM☢️ & NYT v OpenAI🛑 illustrate rising lockdown + legal risk on data. Need more informed training data selection? 🔗 Detailed licenses, terms, sources, properties. 📢 Come help us build it! All open sourced. 1/ 🧵

1

19

52

Shayne Longpre

@ShayneRedford

5 months

🚨 New paper + models! Evaluating LLMs using closed-source LLMs has limited transparency, controllability, and affordability. Incredible work by @seungonekim significantly improves all these factors, w/ open models for either relative or absolute response scoring. ⬇️

Seungone Kim

@seungonekim

5 months

#NLProc Introducing 🔥Prometheus 2, an open-source LM specialized on evaluating other language models. ✅Supports both direct assessment & pairwise ranking. ✅ Improved evaluation capabilities compared to its predecessor. ✅Can assess based on user-defined evaluation criteria.

3

43

168

1

11

52

Shayne Longpre

@ShayneRedford

6 months

The 🌟Data Provenance Initiative🌟 has added dozens of new math 🧮 dataset collections! @ArielNLee @manandey @mhamdy_res have added: 🦆Open Platypus (10) 🧑‍🔬MetaMathQA (8) 🦣Mammoth Math Instruct (13) 🔗 1/

GitHub - Data-Provenance-Initiative/Data-Provenance-Collection

Contribute to Data-Provenance-Initiative/Data-Provenance-Collection development by creating an account on GitHub.

github.com

1

17

51

Shayne Longpre

@ShayneRedford

6 months

Excited to see our 🍮Flan-Palm🌴 work finally published in @JmlrOrg 2024! Looking back, I see this work as pushing hard on scaling: post-training data, models, prompting, & eval. We brought together the methods and findings of many awesome prior works, scaled them up, and

1

9

51

Shayne Longpre

@ShayneRedford

8 months

🌿Aya🌿 Dataset & mega-multilingual LLM is out!🚀 By far the most international & exciting collaboration I'm lucky to be part of. @CohereForAI meetings often spanned 12+ time zones Amazing leadership by Madeline, @mziizm @ahmetustun89 @sarahookr ++ 🔗:

Aya

Cohere’s non-profit research lab, C4AI, released the Aya model, a state-of-the-art, open source, massively multilingual, research LLM covering 101 languages – including more than 50 previously...

cohere.com

Cohere For AI

@CohereForAI

8 months

Today, we’re launching Aya, a new open-source, massively multilingual LLM & dataset to help support under-represented languages. Aya outperforms existing open-source models and covers 101 different languages – more than double covered by previous models.

77

371

1K

1

11

51

Shayne Longpre

@ShayneRedford

2 years

The latest wave of moves to @OpenAI @AnthropicAI @CohereAI etc from Big Tech seems to herald a new chapter to the so called AI/ML “academic brain drain” Academia ➡️ Industry ➡️ Startup

1

5

49

Shayne Longpre

@ShayneRedford

1 year

In March, we were privileged to host @_jasonwei 's guest lecture on “Emergence in Large Language Models”, for broader audiences, covering: ➡️ Intuitions, abilities and limitations ➡️ Scaling & trends ➡️ Emergence, Chain-of-thought, Flan 📽️: 1/

Jason Wei -- Emergence in Large Language Models (March 1st 2023)

Description: This talk will cover broad intuitions about how large language models work. First, we will begin by examining some examples of what language mod...

www.youtube.com

1

6

49

Shayne Longpre

@ShayneRedford

1 year

Thanks for having me! Enjoyed talking to @vagabondjack and the awesome Databricks teams behind Dolly v2 about instruction tuning and open sourcing. Will share out my slides and a video soon!

Mike Conover

@vagabondjack

1 year

Happening now! @ShayneRedford speaking on LLM's as part of the Data Talks speaker series at Databricks. Will be tweeting insights and slides from our inspiring speaker as we go.

4

40

223

1

5

47

Shayne Longpre

@ShayneRedford

1 year

📢 The🌟Foundation Model Transparency Index🌟 scores 10 developers on 💯 indicators. 1⃣ All 10 score poorly, particularly Data, Labor, Compute 2⃣ Transparency is possible! 82/100 are scored by >1 3⃣ Transparency is a precondition for informed & responsible AI policy. 1/

4

15

47

Shayne Longpre

@ShayneRedford

2 years

If Flan-T5 XXL (11B) was too small, there’s a new and improved 20B Flan-UL2 open sourced, by @YiTayML ! The new best Open Source model on MMLU and Big Bench Hard

Yi Tay

@YiTayML

2 years

New open source Flan-UL2 20B checkpoints :) - Truly open source 😎 No forms! 🤭 Apache license 🔥 - Best OS model on MMLU/Big-Bench hard 🤩 - Better than Flan-T5 XXL & competitive to Flan-PaLM 62B. - Size ceiling of Flan family just got higher! Blog:

51

343

2K

0

10

47

Shayne Longpre

@ShayneRedford

7 months

🌟Data Selection🌟 for text has quickly become one of the most important choices for LLMs ✍️ @AlbalakAlon comprehensively formalizes & surveys data filtering ✂️, mixing🥣, dedupliation 🧑‍🤝‍🧑, & selection for a variety of criteria. Loved contributing to the Multi-task/Instruction

Alon Albalak

@AlbalakAlon

7 months

{UCSB|AI2|UW|Stanford|MIT|UofT|Vector|Contextual AI} present a survey on🔎Data Selection for LLMs🔍 Training data is a closely guarded secret in industry🤫with this work we narrow the knowledge gap, advocating for open, responsible, collaborative progress

10

77

307

0

4

46

Shayne Longpre

@ShayneRedford

2 years

#NLPaperAlert Proud of our new work! New findings, SOTA results, and, my personal favourite, 🌟 Open sourcing new models 🌟, significantly better than current T5s!

Quoc Le

@quocleix

2 years

New open-source language model from Google AI: Flan-T5 🍮 Flan-T5 is instruction-finetuned on 1,800+ language tasks, leading to dramatically improved prompting and multi-step reasoning abilities. Public models: Paper:

40

488

2K

2

5

45

Shayne Longpre

@ShayneRedford

6 months

@KreutzerJulia et al. break down multilingual web data (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), by language representation, quality, and characteristics: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets 📜 10/

1

3

44

Shayne Longpre

@ShayneRedford

2 years

Is it just me or is the “AGI” misinfo really peaking lately. Below: Repurposes and misquotes other speculative mis-infographic. And this tweet has massive engagement.

5

43

Shayne Longpre

@ShayneRedford

2 years

The 🌟Flan Collection🌟 (1st used in Flan-PaLM ): ➕ Merges Flan 2021, P3, NIv2, CoT instruction-datasets into 1800+ dataset collection ➕ Data augmentations and mixing strategies ➕ 100s new templates 2/

1

4

43

Shayne Longpre

@ShayneRedford

1 year

Excited to co-host the Instruction Tuning & Instruction Following (ITIF) Workshop at #NeurIPS2023 . 👷🛠️🤖 An incredible line up of topics and speakers. ➡️ Submission opening soon: Stay tuned!

Instruction Workshop, NeurIPS 2023

@itif_workshop

1 year

Excited to announce the Workshop on 👷🛠️🤖 Instruction Tuning & Instruction Following 👷🛠️🤖 (ITIF) at #NeurIPS2023 ! 📅 Join us on Dec 15 in New Orleans 🛠️ Submit by Oct 1 👷 See speaker lineup 🔗 1/

1

17

63

0

5

43

Shayne Longpre

@ShayneRedford

7 months

Read our blog post on @knightcolumbia here: w/ @sayashk @kevin_klyman @ashwinforga @RishiBommasani @random_walker @percyliang @PeterHndrsn

A Safe Harbor for AI Evaluation and Red Teaming

knightcolumbia.org

0

12

41

Shayne Longpre

@ShayneRedford

2 years

🍮🌴 FLAN-PaLM 🍮🌴 Response of the Day #2 Apparently it has a sense of humour!

1

2

42

Shayne Longpre

@ShayneRedford

6 months

📢 Want to automatically generate your bibtex for 1000s of @huggingface text datasets? @chien_vu1692 just added this feature + data summaries for: ➡️ huge collections like Flan, P3, Aya... ➡️ popular OpenAI-generated datasets ➡️ ~2.5k+ datasets & growing 🔗:

0

14

41

Shayne Longpre

@ShayneRedford

2 years

Is it just me, or are a lot of the best new NLP papers on my feed all getting accepted to *Findings* of #EMNLP2022 ? Findings >= Full conference?

2

0

41

Shayne Longpre

@ShayneRedford

2 years

@soniajoseph_ Votes for: ➡️ Model compression/acceleration ➡️ Interpretability ➡️ Multimodalities ➡️ Human-AI interaction (unintended consequences, human health, etc)

1

40

Shayne Longpre

@ShayneRedford

1 year

✨New Paper, Model, Data, & Eval ✨ What’s the best completely public, commercially-viable code LLM? ➡️Introducing, 🐙OctoCoder 🐙, w/ a new instruction code dataset, and expanded task+language HumanEval. ➡️Led by @Muennighoff , these achieve (non-closed) 🌟SotA🌟! 1/

Niklas Muennighoff

@Muennighoff

1 year

How to instruction tune Code LLMs w/o #GPT4 data? Releasing 🐙🤖OctoCoder & OctoGeeX: 46.2 on HumanEval🌟SoTA🌟of commercial LLMs 🐙📚CommitPack: 4TB of Git Commits 🐙🎒HumanEvalPack: HumanEval extended to 3 tasks & 6 lang 📜 💻 1/9

7

88

292

1

8

39

Shayne Longpre

@ShayneRedford

2 years

Grateful to be featured on the @GoogleAI blog with @ada_rob !

Google AI

@GoogleAI

2 years

Today we’re releasing a new collection of tasks, templates and methods for instruction tuning of #ML models. Training on this collection can enable language models to reason more competently over arbitrary, unseen tasks. Learn all about it at:

9

71

390

0

3

38

Shayne Longpre

@ShayneRedford

2 years

Heading to @naacl today 🛫 So excited for my 1st in-person conference since NeuRIPS 2019! Please reach out to grab coffee and talk research! #NAACL2022

0

1

36

Shayne Longpre

@ShayneRedford

7 months

This work is unique in that every collaborator co-led a section or modality, bringing their expertise without which the cheatsheet wouldn’t be possible. Special thanks to @BlancheMinerva @soldni @YJernite @AlbalakAlon @gabriel_ilharco @sayashk @kevin_klyman @kylelostat

0

3

35

Shayne Longpre

@ShayneRedford

7 months

Looking for permissively licensed biomedical🧬text data? @mhamdy_res has added several data collections to the 🌟Data Provenance Initiative🌟: ➡️ MedInstruct ➡️ ChatDoctor ➡️ Medical Meadow ➡️ PMC-LLaMA Instructions 🔗 1/

GitHub - Data-Provenance-Initiative/Data-Provenance-Collection

Contribute to Data-Provenance-Initiative/Data-Provenance-Collection development by creating an account on GitHub.

github.com

2

7

35

Shayne Longpre

@ShayneRedford

2 years

Funny how we got it all flipped. AI in Science Fiction 📚🤖: highly analytical/factual, but struggles with humor and emotion. (eg Data from Star Trek) AI now 💬: great humor and creativity, but logically inconsistent/factually negligent (eg #ChatGPT )

2

4

35

Shayne Longpre

@ShayneRedford

10 months

@yizhongwyz & @qinyuan_ye moderating the 2nd @itif_workshop panel now @nazneenrajani @colinraffel @haozhangml and @tatsu_hashimoto provide their spicy🌶️ takes on evaluation, open vs closed, and how we close the performance gap!

0

6

32

Shayne Longpre

@ShayneRedford

11 months

Context: A Crisis in Data Transparency ➡️Instruct/align finetuning often compiles 100s of datasets ➡️How can devs filter for datasets without legal/ethical risk, and understand the resulting data composition? 2/

2

5

34

Shayne Longpre

@ShayneRedford

2 years

Q: But why are the results strong? Our breakdown of the Flan Collection shows *why* it works. The most important methods: 🌟Finding 1🌟 Fine-tuning on zero-shot and few-shot prompts together significantly improves both settings (not a trade-off)! 4/

1

2

32

Shayne Longpre

@ShayneRedford

5 months

🚨 New #ICML2024 position piece. The most overlooked risks of AI stem from autonomous weaponry For 4 reasons: 1⃣ Arms race w/ ⬇️ human oversight 2⃣ Reduces cost of starting conflicts 3⃣ Evades accountability 4⃣ Battlefield errors aren’t considered costly See our work led by

3

13

33

Shayne Longpre

@ShayneRedford

1 year

@suchenzang Respectfully, this feels a little unfair. Three awesome, nuanced threads (by Aran, Yizhong, and the new paper Yi cites, linked below) all suggest different training data, architecture, and vocab each lead to different advantages, without any **one size fits all**. T0, Flan-T5,

1

0

32

Shayne Longpre

@ShayneRedford

4 years

Just found out "MKQA" (our 26 language QA dataset) is already on @huggingface Datasets -- even before I got around to it this holiday break! Check it out, and thanks @ceyda_cinarel for adding it!

GitHub - apple/ml-mkqa: We introduce MKQA, an open-domain question answering evaluation set...

We introduce MKQA, an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The...

github.com

0

8

32

Shayne Longpre

@ShayneRedford

1 year

In Honolulu for #ICML2023 23-29 🏝️🌺 Msg me if you'd like to meet/chat! Presenting the Flan Collection🥮 (Wedn 11-12:30), and a new initiative on Data Provenance at the Workshop, Saturday!

0

3

32

Shayne Longpre

@ShayneRedford

2 years

🌟Finding 3 🌟 The Flan-T5 model converges higher and faster than T5 on single-task fine-tuning. ➡️ Recommendation: Use Flan-T5 as your base model for new tasks. ✅Better computational-efficiency and performance! 6/

5

3

31

Shayne Longpre

@ShayneRedford

11 months

📢 We're planning to rapidly expand the coverage of finetuning datasets in the Data Provenance Initiative. ➡️Our Goal: provide detailed source+license annotations for all popular instruct/align datasets. 🌟If you'd like to contribute, reach out!🌟 1/

Shayne Longpre

@ShayneRedford

11 months

📢Announcing the🌟Data Provenance Initiative🌟 🧭A rigorous public audit of 1800+ instruct/align datasets 🔍Explore/filter sources, creators & license conditions ⚠️We see a rising divide between commercially open v closed licensed data 🌐: 1/

10

148

462

4

11

31

Shayne Longpre

@ShayneRedford

2 years

This yields the best performing instruction tuning collection that has been compiled and released into one repo. See our survey Figure of the prior works we built on to produce this compilation. 3/

2

30

Shayne Longpre

@ShayneRedford

6 months

📢 Check out @_anthonychen and my invited talk at the at the @USC_ISI Natural Language Seminar: 📜 "The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI" Thank you @HJCH0 for hosting!

The Data Provenance Initiative: A Large Scale Audit of Dataset...

Date Presented: 3/21/2024Speakers: Anthony Chen and Shayne Longpre, MITAbstract: The arms race to train language models on vast, diverse, and inconsistently ...

www.youtube.com

2

10

31

Shayne Longpre

@ShayneRedford

2 years

The new bible for Transformer architecture/objective choices!

Adam Roberts

@ada_rob

2 years

I've found myself referring to this table in the appendix of our recent paper () A LOT so I thought I'd point it out for those who might benefit from it. Basic upshot is that ED:MLM works really well, especially for classification tasks, even in 0shot.

5

13

94

0

2

30

Shayne Longpre

@ShayneRedford

11 months

Really excited to see our Pretrainer's Guide work get a shoutout in this amazing new Dataset! "Quality" and "Toxicity" filters for pretraining data can have a massive impact and is important to get right.

Together AI

@togethercompute

11 months

We are excited to release RedPajama-Data-v2: 30 trillion filtered & de-duplicated tokens from 84 CommonCrawl dumps, 25x larger than our first dataset. It exposes a diverse range of quality annotations so you can slice & weight the data for LLM training.

20

281

1K

2

4

30

Shayne Longpre

@ShayneRedford

5 months

Congrats to a phenomenal team of collaborators! 👏🏼🎉

Cohere For AI

@CohereForAI

5 months

🪴 We’re thrilled about the acceptance of "Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model" - major congrats to @ahmetustun89 , @viraataryabumi , @yong_zhengxin , @weiyinko_ml , @mrdanieldsouza , @lekeonilude , @NeelBhandari9 , @singhshiviii , Hui-Lee Ooi,

1

11

33

0

6

29

Shayne Longpre

@ShayneRedford

4 years

QA models are surprisingly accurate on partial, OOD inputs. Yi, @chrisdubois , and I ask what factors affect this? Perplexingly we find this phenomenon invariant to random seed, architecture, pretraining, even training domain! See our short preprint [1/5]

On the Transferability of Minimal Prediction Preserving Inputs in...

Recent work (Feng et al., 2018) establishes the presence of short, uninterpretable input fragments that yield high confidence and accuracy in neural models. We refer to these as Minimal Prediction...

arxiv.org

1

8

30

Shayne Longpre

@ShayneRedford

1 year

Question: What common benchmarks should all Instruct/Chat-LLMs be evaluated on? (We're considering a special track for @itif_workshop #NeurIPS2023 to promote more fair comparison.) 1/

4

5

29

Shayne Longpre

@ShayneRedford

10 months

The importance of understanding what’s in your data 👈

Samantha Cole

@samleecole

10 months

big breaking news: LAION just removed its datasets, following a study from Stanford that found thousands of instances of suspected child sexual abuse material

2

593

2K

0

3

28

Shayne Longpre

@ShayneRedford

7 months

Our goals: ➡️ Lower barriers to newer community members ➡️ Bring awareness to thoughtful tools & good practices, not just speed of development ➡️ Iterate a live, evolving resource (🌟you can contribute!🌟) 2/

1

2

27

Shayne Longpre

@ShayneRedford

1 year

📢 #NLPaperAlert Incredible work led by @seungonekim @jshin491 & Yejin Cho! ➡️ Ranking responses is subjective ➡️ Do you prefer short vs long, formal vs informal, creative vs precise responses? ➡️ 🔥Prometheus 🔥➡️Open LM trained to respect custom criteria Check it out!

Seungone Kim

@seungonekim

1 year

Excited to present 🔥Prometheus, a fully open-source evaluator LM that is on par with GPT-4 evaluation when the “appropriate” reference materials are appended! * Could generalize to customized score rubrics * Shows high correlation with both human evaluators & GPT-4 evaluation

9

51

348

0

6

28

Shayne Longpre

@ShayneRedford

4 months

Fantastic work, w/ a new, thoughtful, task-diverse, fine-grained eval benchmark, BigGen Bench. + Eval scaling laws + Task difficulty rankings + Task metric +/- from base ➡️ chat models + A trove of other insights! Hard to believe @seungonekim is an *incoming* PhD!

Seungone Kim

@seungonekim

4 months

🤔How can we systematically assess an LM's proficiency in a specific capability without using summary measures like helpfulness or simple proxy tasks like multiple-choice QA? Introducing the ✨BiGGen Bench, a benchmark that directly evaluates nine core capabilities of LMs.

7

54

186

2

4

28

Shayne Longpre

@ShayneRedford

2 years

🌟Finding 2🌟 Input inversion and data source balancing (as proposed and corroborated by MetaICL, T0, OPT-IML and others...) are incredibly important for successful instruction tuning. See our ablations Table. 5/

2

3

26

Shayne Longpre

@ShayneRedford

2 months

Lastly, this research was the result of incredible dedication from a team of international collaborators, from undergrads to professors, designers, social scientists, AI developers, and independent researchers, all on a volunteer basis. A heartfelt thank you to our awesome team

0

3

27

Shayne Longpre

@ShayneRedford

9 months

💯% this 👇 We don’t need anymore bi-weekly model releases with 0.3% improvement on MMLU (or radar plot)

Luca Soldaini 🎀

@soldni

9 months

my one 2024 ML wish: more careful, thought out, open language model research—stop rushing! neither e/acc nor closed-door AI, sorry to disappoint 😌

1

4

81

1

27

Shayne Longpre

@ShayneRedford

4 years

Proud of our 1st EMNLP Findings paper!! 🎉🎉 w/ @LeBrave2016 and @chrisdubois Q: How effective are common NLP data augmentations with pre-trained transformers? A: Often times, not very. (Negative result!) [1/4]

1

7

25