Hailey Schoelkopf @haileysch__ Twitter profile

Last Seen Profiles

@kanpatchbull

@RTweedy1975

@afonso_axe

@UR2BjA0q30kz80i

@tetasdebinnie_

@chittows

@RDTLeighAnne

@Reset_sui

@bntn215456

@tmalharbi

@LigaMunicipalRD

@KBeds

@Lilyrose08x

@a_damned_prince

@kkshow

@FMBhutan

@brody_guinn

@ihj_sr_t

@sotwecom

@clompdub

@honeyshivy

@mukeshb22315350

@Jaispointofview

@Mrwhosetheboss

@uiowahmp

@PCasomera

@hinadon77

@bluesdodjavan

@_ladynjd

@deanosturts

@VanessaRosariio

@paraparaNicki

@19takumi92

@yogurns

@hyelinie

@BinorRaja

Hailey Schoelkopf

@haileysch__

10 months

work in ML, they said. it’ll be fun, they said. Now I’m reading about the Based architecture and its HellaSwag score

31

47

1K

Hailey Schoelkopf

@haileysch__

1 year

Excited to announce our paper Pythia: A Suite for Analyzing LLMs across Training and Scaling has been accepted as an Oral paper at #ICML2023 !

12

65

452

Hailey Schoelkopf

@haileysch__

4 months

My favorite bit in this paper: I and @bbrabbasi wrote an appendix formalizing what is done evaluating models with loglikelihood multiple choice and perplexity evals. afaik, none of this has been written up in one place in most papers and just been tacitly assumed before!

EleutherAI

@AiEleuther

4 months

Excited to share our new paper, Lessons From The Trenches on Reproducible Evaluation of Language Models! In it, we discuss common challenges we’ve faced evaluating LMs, and how our library the Evaluation Harness is designed to mitigate them 🧵

4

69

241

8

35

246

Hailey Schoelkopf

@haileysch__

1 year

Beyond excited for our work on Llemma to finally be public!! We trained very strong general math LMs (7B > Minerva 8B, 34B ~= Minerva 62B) and released them + training, eval, analysis code! Can’t wait to see the math+AI field pick these up for future developments in the open.

Zhangir Azerbayev

@zhangir_azerbay

1 year

We release Llemma: open LMs for math trained on up to 200B tokens of mathematical text. The performance of Llemma 34B approaches Google's Minerva 62B despite having half the parameters. Models/data/code: Paper: More ⬇️

11

126

549

3

28

204

Hailey Schoelkopf

@haileysch__

11 months

This proposal suggests a global moratorium on training runs > 10^24 flops—approximately that used by llama 2-70B, and 100x lower than the EO reporting threshold. Just astoundingly absurd

Andrea Miotti

@_andreamiotti

11 months

The AI Summit consensus is clear: it's time for international measures. Here is a concrete proposal. In our recent paper, @jasonhausenloy , Claire Dennis and I propose an international institution to address extinction risk from AI: MAGIC, a Multinational AGI Consortium.

103

21

112

13

178

Hailey Schoelkopf

@haileysch__

7 months

so, it turns out i am the top solo user for downloads on HF (thanks solely to my lm-eval MMLU mirror! 😅) now is probably a good time to express how grateful i am for the users, contributors, and community around the LM Evaluation Harness! :’)

6

18

174

Hailey Schoelkopf

@haileysch__

10 months

so that I can use it in Oogabooga

3

0

145

Hailey Schoelkopf

@haileysch__

1 year

We’ve released the paper for Pythia, a set of LLMs designed to facilitate scientific study on LLMs and their training data! So excited to have this out finally! Read more here:

Stella Biderman

@BlancheMinerva

1 year

Have you ever wanted to do an experiment on LLMs and found that none of the existing model suites met your needs? At @AiEleuther we got tired of this happening and so designed a model suite that centers enabling scientific research as its primary goal

11

183

888

4

17

124

Hailey Schoelkopf

@haileysch__

5 months

what’s the current SOTA for KV cache compression? what are some must-read papers on this topic?

12

9

127

Hailey Schoelkopf

@haileysch__

7 months

👀it's always incredible to me just how ubiquitous and clear the induction head bump is

5

6

118

Hailey Schoelkopf

@haileysch__

1 year

Presenting Pythia at board # 609 at 10:30 am today! Come by to talk LLM training, enabling interp + novel data effect studies, open science, and more!

4

14

106

Hailey Schoelkopf

@haileysch__

20 days

where were you when GPQA was killed

7

6

105

Hailey Schoelkopf

@haileysch__

7 months

the parallel generation trick in Quiet-STaR is really neat and I wish I’d thought of it

3

4

99

Hailey Schoelkopf

@haileysch__

1 year

Hey that’s me! :) Had a great time talking with @MichaelTrazzi about our team’s work on the Pythia LLMs while at #ICML2023 .

2

6

100

Hailey Schoelkopf

@haileysch__

1 year

I'm glad Deepspeed maintainers are as confused as I am by the library internals

2

5

92

Hailey Schoelkopf

@haileysch__

2 years

Come check out what we’ve been working on recently: a set of LLM checkpoints over time (16 models x 143 checkpoints each) with consistent data ordering across model scales! available now

EleutherAI

@AiEleuther

2 years

What do LLMs learn over the course of training? How do these patterns change as you scale? To help answer these questions, we are releasing a Pythia, suite of LLMs + checkpoints specifically designed for research on interpretability and training dynamics!

4

87

473

4

14

93

Hailey Schoelkopf

@haileysch__

5 months

MLA TL;DR(?): - low-rank projs for Q, K, V, k+v share down-proj(!) - no pos. enc - for RoPE compat. : stack this with MQA that uses a smaller head dim. , RoPE on the MQA intuition for why this works: like only applying RoPE to 25% of head dim in MHA?

DeepSeek

@deepseek_ai

5 months

DeepSeek-V2 is a strong, economical, and efficient MoE language model, enhanced with exceptional architectural designs in attention mechanisms and sparse layers: 🌟 MLA (Multi-head Latent Attention): a better and faster attention that ensures efficient inference via reducing KV

5

19

163

3

9

94

Hailey Schoelkopf

@haileysch__

1 year

I used to really be gunning for prefixLM / UL2 but I'm no longer sold on it. Maybe it wins out by a couple points on MMLU, but it's a lot more hparams to tweak and more expensive for generation / multi-turn settings. Causal is cheap, good, easy. This looks like cool work though!

AK

@_akhaliq

1 year

CausalLM is not optimal for in-context learning paper page: Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each

7

86

415

2

8

94

Hailey Schoelkopf

@haileysch__

9 months

Support for Mamba has landed in lm-evaluation-harness! Use it with `--model mamba_ssm` : Was really happy to see @_albertgu @tri_dao provide support for our new release natively alongside their architecture code, to benchmark against Pythia reproducibly!

3

5

81

Hailey Schoelkopf

@haileysch__

1 year

If your LM training company finds that evals are broken (surprise!), and decides to develop your own higher quality test sets— consider open-sourcing them! i will help you.

8

6

80

Hailey Schoelkopf

@haileysch__

5 months

@tamaybes a super-fun arcane historical detail: Gopher (and by extension Chinchilla) use Transformer-XL style position encodings. This means they spend 20B params (Gopher) and 5B params (Chinchilla) on just rel. position encoding!

1

6

80

Hailey Schoelkopf

@haileysch__

10 months

Giving a talk tomorrow 12pm EST at @gordic_aleksa 's server! Come to hear me chat about Pythia, Llemma, LM-Eval-Harness, and producing infrastructure for future OSS research exploration!

Aleksa Gordić 🍿🤖

@gordic_aleksa

10 months

Hailey Schoelkopf ( @haileysch__ ) is giving a talk tomorrow in my server! :) Hailey is/was part of many impactful projects such as BLOOM, Pythia, she's also core dev on @AiEleuther 's lm-evaluation-harness, etc talk title: pythia - research infra for LLMs discord:

0

2

27

1

7

71

Hailey Schoelkopf

@haileysch__

2 months

reread the whole gopher paper and there are some really great "precursor" results (including negative results), predating some of the more recent trainng trends, in their appendices that I'd somehow forgotten?

Scaling Language Models: Methods, Analysis & Insights from...

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper,...

arxiv.org

2

6

71

Hailey Schoelkopf

@haileysch__

1 year

banning LMs trained on > 1T tokens is a surefire way to make sure byte-level never makes a comeback

Nora Belrose

@norabelrose

1 year

I'm opposed to any AI regulation based on absolute capability thresholds, as opposed to indexing to some fraction of state-of-the-art capabilities. The Center for AI Policy is proposing thresholds which already include open source Llama 2 (7B). This is ridiculous.

56

37

406

7

2

70

Hailey Schoelkopf

@haileysch__

1 year

There’s a large literature on memorization in LLMs—but how can we move this theory into practice? We show how LLM trainers can treat memorization as a first-class concern in training, and put forward the first attempts at predicting precise behavior before training a model!

Stella Biderman

@BlancheMinerva

1 year

Not all memorization in LLMs is created equal. Some strings are bad to memorize, but most are quite benign. In our new paper we ask the questions "can we tell ahead of time which text will be memorized by a LLM?" and "can we change our mindset to make the answer yes?" A 🧵

11

101

568

2

9

67

Hailey Schoelkopf

@haileysch__

7 months

the existence of Based and Mamba implies in a few months someone will put out an arch we have to call “Based Mamba” until the end of time

12

0

68

Hailey Schoelkopf

@haileysch__

1 year

I'll be at #ICML2023 next week! I'd love to chat about LLM training + infra, interp, evaluation, and tracing model behavior back to the training data! DMs are open :)

2

4

68

Hailey Schoelkopf

@haileysch__

10 months

Worth noting that GPT-4 was trained on a majority of the GSM8k training set intentionally. We don’t have a full understanding of synthetic data and how it passes down contamination, but good to remember when using GPT-4 or GPT-3.5 data for finetuning.

Aran Komatsuzaki

@arankomatsuzaki

10 months

TinyGSM: achieving >80% on GSM8k with small language models A duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5% accuracy, outperforming existing models that are orders of magnitude larger

1

42

183

3

4

66

Hailey Schoelkopf

@haileysch__

3 months

native int8 training confirmed! really excited to see Character starting to share more technical details

Noam Shazeer

@NoamShazeer

3 months

Character AI is serving 20,000 QPS. Here are the technologies we use to serve hyper-efficiently. [ ]

31

199

1K

2

5

64

Hailey Schoelkopf

@haileysch__

4 months

So excited for this to be released! We’ve written a paper describing what we’ve learned from building the Eval Harness, including general advice on evaluation best practices for LMs.

EleutherAI

@AiEleuther

4 months

Excited to share our new paper, Lessons From The Trenches on Reproducible Evaluation of Language Models! In it, we discuss common challenges we’ve faced evaluating LMs, and how our library the Evaluation Harness is designed to mitigate them 🧵

4

69

241

2

4

62

Hailey Schoelkopf

@haileysch__

2 years

this is a point really worth hammering home—chinchilla laws don’t matter where people think they do. if you try to train a 1B param LM on only 20B tokens you *are* going to have a bad time!

Stella Biderman

@BlancheMinerva

2 years

@MetaAI @GuillaumeLample Another thing I anticipate being a massive source of confusion: Their smaller models are massively overtrained. The fact that their 13B model meets or exceeds GPT-3 in performance is NOT contradictory to anything in the Chinchilla paper, because it's not compute-optimally trained

5

9

94

1

60

Hailey Schoelkopf

@haileysch__

9 months

Anthropic: “Users of a large language model may not know about hidden backdoors in the model if they lack access to a model’s parameters or a full understanding of its training process and dataset”

Sleeper Agents: Training Deceptive LLMs that Persist Through...

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the...

arxiv.org

4

61

Hailey Schoelkopf

@haileysch__

11 months

prediction: Gisting () will be or already is a huge deal for scaffolded LLMs and synthetic data generation

Learning to Compress Prompts with Gist Tokens

Prompting is the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and repeatedly encoding the same prompt...

arxiv.org

3

4

58

Hailey Schoelkopf

@haileysch__

6 months

MoD(E) looks like great work and I’m really glad it was published! but saving 50% of flops for (small batch) inference may not be a straight 2x speedup. Early-exit a la CALM doesn’t give reliable speedups for bs>1: if any tokens don’t exit early, you have to wait on them

Aran Komatsuzaki

@arankomatsuzaki

6 months

Google presents Mixture-of-Depths Dynamically allocating compute in transformer-based language models Same performance w/ a fraction of the FLOPs per forward pass

6

90

611

9

1

59

Hailey Schoelkopf

@haileysch__

6 months

breaking my "agents" silence: this interface design requires: - being realistic about current LLMs' limitations (> 100 LoC confuses them!) - making tasks easier for LLMs by making them closer to what they've seen often in pretraining (see: Embers of Autoregression) fascinating

John Yang

@jyangballin

6 months

Simply connecting an LM to a vanilla bash terminal does not work well. Our key insight is that LMs require carefully designed agent-computer interfaces (similar to how humans like good UI design) E.g. When the LM messes up indentation, our editor prevents it and gives feedback

2

33

260

1

2

59

Hailey Schoelkopf

@haileysch__

3 months

I will die on the “Mistral-7B did not secretly use prefixLM” hill until or unless it’s explicitly confirmed otherwise

5

1

57

Hailey Schoelkopf

@haileysch__

5 months

@tamaybes but current models don’t allocate parameters to rotary embs! this means the Chinchilla D=20*N is skewed already for the actual param counts of most models, even if it held across datasets! If we disregarded the pos. encoding params the coefficients would change

2

56

Hailey Schoelkopf

@haileysch__

1 year

a machine will never understand commutativity

Owain Evans

@OwainEvans_UK

1 year

Does a language model trained on “A is B” generalize to “B is A”? E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?” Our new paper shows they cannot!

175

707

4K

3

6

55

Hailey Schoelkopf

@haileysch__

7 months

even if you’re not using subquadratic architectures for your frontier training run, you should totally be using them for speculative decoding… pic unrelated

1

6

56

Hailey Schoelkopf

@haileysch__

1 year

@ilex_ulmus @BlancheMinerva @OntolMQ 2 trillion tokens is not a lot, and the Falcon team (models released by the UAE) was easily able to put 5 trillion tokens together with few engineers. There are no insurmountable barriers in current LM training for even moderately well-funded entities, let alone state actors.

2

52

Hailey Schoelkopf

@haileysch__

4 months

Excited for this new collab with @RylanSchaeffer to be out! Why are downstream evals harder to predict with scale than pretraining loss? for loglikelihood-based MCQA, we find an explanation!

Rylan Schaeffer

@RylanSchaeffer

4 months

❤️‍🔥❤️‍🔥Excited to share our new paper ❤️‍🔥❤️‍🔥 **Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?** w/ @haileysch__ @BrandoHablando @gabemukobi @varunrmadan @herbiebradley @ai_phd @BlancheMinerva @sanmikoyejo 1/N

8

54

262

1

5

52

Hailey Schoelkopf

@haileysch__

8 months

Some really cool inventive “wacky” techniques from the OpenBMB team (annealing data contents in training + instruct data in cooldown! atypical infinite LR schedules! scaling embedding by… 12!) it’s clear we haven’t even begun to exhaust the possibilities of LLM training.

OpenBMB

@OpenBMB

8 months

MiniCPM Blog ： MiniCPM: Unveiling the Potential of End-side Large Language Models. #MiniCPM -2B: An end-side LLM outperforms Llama2-13B @huggingface @_akhaliq @Xianbao_QIAN

4

17

69

4

8

51

Hailey Schoelkopf

@haileysch__

3 months

I'll be at @icmlconf all week next week! Especially interested in chatting about: - evaluations: infra, missing evals, what's next - systems optimizations for distributed training+inference - predictable scaling (scaling laws, eval forecasting...) DM if you'd like to chat! :)

0

3

50

Hailey Schoelkopf

@haileysch__

1 year

it’s really incredible to have more openly available datsets, congrats to @soldni @allen_ai for the heroic effort on this release! this also marks (to my knowledge) only the third LLM pretraining corpus to provide a datasheet :) very glad that list is longer than last year’s!

Luca Soldaini 🎀

@soldni

1 year

Announcing Dolma, the dataset for @allen_ai 's LLM, OLMo. It's 3+ trillion tokens (web/papers/code/books/wiki). We hope it will facilitate study of LLMs & their behavior! Released on @huggingface w ImpACT license Overview/datasheet

21

144

558

2

7

50

Hailey Schoelkopf

@haileysch__

1 year

Belated announcement: honored to have presented at @MITFutureTech workshop on AI Scaling! I talked about how we scale LLMs today, and how it might look in a few years.

MIT FutureTech

@MITFutureTech

1 year

Hailey Schoelkopf @haileysch__ from @aiEleuther discussing the future of LLMs. #MITAIScaling @MITFutureTech

0

8

1

6

49

Hailey Schoelkopf

@haileysch__

16 days

one must imagine paris hilton getting 80% MFU

Andrew Drozdov

@mrdrozdov

18 days

stop 🤯

0

2

23

4

0

48

Hailey Schoelkopf

@haileysch__

4 months

the real move would be to A/B test Golden Gate Claude versus a sysprompt

Anthropic

@AnthropicAI

4 months

This week, we showed how altering internal "features" in our AI, Claude, could change its behavior. We found a feature that can make Claude focus intensely on the Golden Gate Bridge. Now, for a limited time, you can chat with Golden Gate Claude:

110

264

2K

3

48

Hailey Schoelkopf

@haileysch__

1 year

We wrote a blog post on the napkin math that goes into training LLMs at scale! Check it out here:

EleutherAI

@AiEleuther

1 year

The most common question we get about our models is "will X fit on Y GPU?" This, and many more questions about training and inferring with LLMs, can be answered with some relatively easy math. By @QuentinAnthon15 , @BlancheMinerva , and @haileysch__

12

102

509

1

2

46

Hailey Schoelkopf

@haileysch__

2 months

really cool in-depth exploration of prompt sensitivity for llm evals! wish I’d known about this paper earlier

Max Ryabinin

@m_ryabinin

2 months

In our new #ACL2024 paper, we show that LLMs remain sensitive to prompt formats even with improved few-shot techniques. Our findings suggest that careful evaluation needs to take this lack of robustness into account 📜: 🖥️:

2

12

64

2

47

Hailey Schoelkopf

@haileysch__

3 months

New survey on "meta-generation" algorithms for generating text from LLMs with greater inference compute. Check out section 7 for a discussion of how methods to speed up generation interact with these complex algorithms, from Best-of-N to beam search to MCTS and beyond!

Sean Welleck

@wellecks

3 months

What do nucleus sampling, tree-of-thought, and PagedAttention have in common? They're all part of our new survey: "From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models"

10

115

546

2

5

45

Hailey Schoelkopf

@haileysch__

7 months

I have developed a truly marvelous scaffolded LLM agent, a proof of which this margin is too narrow to contain…

0

3

45

Hailey Schoelkopf

@haileysch__

11 months

If there's anything less glamorous yet high-impact in ML than looking at the data, it's doing due diligence on licensing. This is incredible work!

Shayne Longpre

@ShayneRedford

11 months

📢Announcing the🌟Data Provenance Initiative🌟 🧭A rigorous public audit of 1800+ instruct/align datasets 🔍Explore/filter sources, creators & license conditions ⚠️We see a rising divide between commercially open v closed licensed data 🌐: 1/

10

148

463

0

10

45

Hailey Schoelkopf

@haileysch__

2 months

@clefourrier if you haven’t already read @kamilelukosiute ‘s blog, you should!! really fantastic stuff

You need to be spending more money on evals - kamilė lukošiūtė

You need to be spending more money on evals. Alternative title: I paid $440 for error bars and you should too. Alternative alternative title: You're reading too much into individual evaluation result…

kamilelukosiute.com

4

1

44

Hailey Schoelkopf

@haileysch__

1 year

full of joy post-ICML meeting old and new friends :) If we missed each other this week, definitely reach out!

1

44

Hailey Schoelkopf

@haileysch__

2 months

Our "Challenges in LM Evaluation" ICML24 tutorial slides are now public! Thanks for the feedback from everyone who attended :)

EleutherAI

@AiEleuther

2 months

We were very happy with the reception to our researchers @lintangsutawika and @haileysch__ 's ICML tutorial, "Challenges in LM Evaluation", this past week! For all those who requested it, the slides are now available at . Enjoy!

1

12

36

0

2

41

Hailey Schoelkopf

@haileysch__

1 month

@WenhuChen this should be an artifact of the induction head bump which shows up in most transformer loss curves!

Hailey Schoelkopf

@haileysch__

7 months

👀it's always incredible to me just how ubiquitous and clear the induction head bump is

5

6

118

2

0

42

Hailey Schoelkopf

@haileysch__

2 years

Drago was such a kind, passionate mentor who poured so much time and effort into helping students at all levels. His devotion to the field really showed by his actions. Stunned by his loss, I would not be working in NLP now if it were not for him. May he rest in peace.

Harlan Krumholz

@hmkyale

2 years

The #AI community, the #computerscience community, the @YaleSEAS community, and humanity have suddenly lost a remarkable person, @dragomir_radev - kind and brilliant, devoted to his family and friends... gone too soon. A sad day @Yale @YINSedge @YaleCompsci #NLP2023

41

87

388

1

3

39

Hailey Schoelkopf

@haileysch__

6 months

@andersonbcdefg you're moving fast on this huh

Ben (e/treats)

@andersonbcdefg

6 months

so you can't use DBRX to improve other LLMs... but they never said you can't use it to make them Worse

10

1

133

1

0

39

Hailey Schoelkopf

@haileysch__

8 months

New paper out, streamlining the RLAIF setup presented in Constitutional AI to use critiques + revisions directly as natural language feedback! Hope to see more work using this to effectively constrain LLM-based systems’ behavior!

SynthLabs

@synth_labs

8 months

We also present Direct Principle Feedback (DPF) as a way to address this. Rather than relying on reranking, we can use the before/after of a revision as a pairwise prefs. 5/N

1

4

18

1

6

37

Hailey Schoelkopf

@haileysch__

6 months

@Teknium1 the next time someone puts out a benchmark like this, they should secretly hold out a 2nd test set, then evaluate the SOTA methods on it 6-12 months later to see how much overfitting / purposeful hill climbing was done on the public one

2

0

37

Hailey Schoelkopf

@haileysch__

10 months

@JacquesThibs yes they were

0

35

Hailey Schoelkopf

@haileysch__

2 months

Proud to have played a very small part in this large-scale audit of 14,000+ domains and their robots.txt consent policies in public pretraining corpora!

Shayne Longpre

@ShayneRedford

2 months

✨New Preprint ✨ How are shifting norms on the web impacting AI? We find: 📉 A rapid decline in the consenting data commons (the web) ⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic) ⛔️ Robots.txt preference protocols

10

95

255

0

2

34

Hailey Schoelkopf

@haileysch__

1 year

Excited to see how this goes, and I will definitely submit some of my work here! I’m a little concerned this will only increase insularity and ignorance of work prior to LLMs, but glad the scope is large in what is considered “language modeling”!

Sasha Rush

@srush_nlp

1 year

Introducing COLM () the Conference on Language Modeling. A new research venue dedicated to the theory, practice, and applications of language models. Submissions: March 15 (it's pronounced "collum" 🕊️)

34

434

2K

2

3

34

Hailey Schoelkopf

@haileysch__

10 months

apropos of nothing: there’s no post-Chinchilla (published) scaling law research on MoE models. what’s the optimal tokens::dense params::sparse params ratio for a given FLOP budget? are MoEs more data hungry? how much? if you’re interested in working on this, let’s get in touch!

3

0

34

Hailey Schoelkopf

@haileysch__

1 year

@BlancheMinerva LLM.int8() is the “canonical” reference for this as far as I’m aware:

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection...

arxiv.org

0

33

Hailey Schoelkopf

@haileysch__

6 months

Torchtune is shipping with LM Evaluation Harness integration for evals of finetunes! Excited to see lm-eval adopted by the ecosystem—evals are crucial. we ( @lintangsutawika and I) are looking forward to collaborating with the torchtune team to build out deeper integration!

Kartikay Khandelwal

@kakemeister

6 months

Really excited to officially release torchtune: a PyTorch-native library for easily fine-tuning LLMs! Code: Blog: Tutorials: [1/5]

4

78

337

1

4

33

Hailey Schoelkopf

@haileysch__

1 year

@deliprao yet another reason to distrust “emergent abilities” with LLM scale—we have no way of knowing if generational improvements of models (GPT-3.5->GPT-4) are due to compute expended or just more expert data!

5

2

31

Hailey Schoelkopf

@haileysch__

3 months

Come say hi at ICML and hear what @AiEleuther team and community members have been up to!

EleutherAI

@AiEleuther

3 months

Looking for EleutherAI at @icmlconf ? Come meet our community and check out their fabulous work. Featuring (in order of appearance): @lintangsutawika @haileysch__ @aviskowron @BlancheMinerva @Vermeille_ @Void13950782 @dashstander @qinan_yu @norabelrose @CurtTigges

2

13

52

1

30

Hailey Schoelkopf

@haileysch__

6 months

there are a lot more extant Triton kernels out there now than there were even 6 months ago. are they listed anywhere already? if not, would collecting links to these be useful for the community?

4

0

31

Hailey Schoelkopf

@haileysch__

9 months

@gneubig at the current stage “aligned” in most finetuning papers doesn’t mean anything different from “instruction tuned”, though it somewhat connotes preference or safety fine-tuning

1

2

30

Hailey Schoelkopf

@haileysch__

8 months

so do i have to learn how NeRFs work now

Bill Peebles

@billpeeb

8 months

welcome to bling zoo! this is a single video generated by sora, shot changes and all.

202

578

4K

6

1

30

Hailey Schoelkopf

@haileysch__

1 year

No more guesswork needed to assess contamination, with public training data! Ruling out contamination with web-scale data is nigh impossible so documentation + replicability is crucial to understand + measure it better.

Zhangir Azerbayev

@zhangir_azerbay

1 year

Finally, we seek to quantify the effect of memorization. Surprisingly, we find that Llemma is no more accurate on problems that appear in its training set. Because our code and data are open, we encourage others to replicate and extend our analysis. 9/n

1

33

2

1

29

Hailey Schoelkopf

@haileysch__

4 months

want to do a deep read of mamba-2 immediately but fighting against time for neurips datasets and benchmarks ;-; there are dozens of us!

4

0

29

Hailey Schoelkopf

@haileysch__

4 months

Thanks @hugobowne @HamelHusain for hosting me! I’ll be giving a talk about the gritty details of evaluation, don’t miss it :)

Hugo Bowne-Anderson

@hugobowne

4 months

💫We're super excited to announce a new speaker: @haileysch__ (from @AiEleuther & LM Eval harness) will be speaking on A Deep Dive on LLM Evaluation -- can't wait for this one!

4

5

50

1

4

29

Hailey Schoelkopf

@haileysch__

1 year

@norabelrose gating things based on MMLU % is *crazy*

3

1

28

Hailey Schoelkopf

@haileysch__

7 months

v0.4.2 of lm-evaluation-harness is now available on PyPI! Very happy about the contributions here and how lm-eval has been used lately by the community :))

EleutherAI

@AiEleuther

7 months

A new minor version release, 0.4.2, of the lm-evaluation-harness is available on PyPI! 1/n

1

6

35

0

1

29

Hailey Schoelkopf

@haileysch__

6 months

very cool work analyzing the properties of small LMs using Pythia!

Nathan Godey @COLM 🇺🇸

@nthngdy

6 months

🤏 Why do small Language Models underperform? We prove empirically and theoretically that the LM head on top of language models can limit performance through the softmax bottleneck phenomenon, especially when the hidden dimension <1000. 📄Paper: (1/10)

15

125

604

0

28

Hailey Schoelkopf

@haileysch__

2 months

Come to @lintangsutawika and I’s tutorial this afternoon at ICML!

Lintang Sutawika

@lintangsutawika

2 months

Catch our tutorial at ICML today at Lehar 1-4 from 3:30-5:30 pm!

1

5

36

0

3

28

Hailey Schoelkopf

@haileysch__

2 years

We released our first LM @carperai , testing the ability of models trained on FIM to infill code! Much more forthcoming from us, putting these results to use in better code LMs for pair programming very soon ;)

1

2

28

Hailey Schoelkopf

@haileysch__

10 months

goodbye MFU, hello goodput

4

0

28

Hailey Schoelkopf

@haileysch__

5 months

Rigorously evaluating “agents” takes thought! great work debunking the (cost-normalized) performance of popular coding agents by @random_walker @sayashk @benediktstroebl .

Arvind Narayanan

@random_walker

5 months

On tasks like coding we can keep increasing accuracy by indefinitely increasing inference compute, so leaderboards are meaningless. The HumanEval accuracy-cost Pareto curve is entirely zero-shot models + our dead simple baseline agents. New research w @sayashk @benediktstroebl 🧵

5

33

196

0

4

27

Hailey Schoelkopf

@haileysch__

3 months

I’m really excited about GoldFinch-style archs for local models. Like YOCO, it only has 1 global KV cache—but adds a 16x K-cache compression and avoids caching values entirely (2x compression)! And perf is still good. KV cache too cheap to meter! Congrats to @smerkyg et al!

Dan Goldstein

@smerkyg

3 months

Our new paper on the GoldFinch 🐦 hybrid Transformer architecture just dropped 🐣 at ! GoldFinch 🐦 combines the best parts of Linear Attention (via RWKV) and traditional Transformers to create something that is better than either one on its own!

7

38

122

0

2

25

Hailey Schoelkopf

@haileysch__

1 year

@round I recently learned about hinton diagrams:

2

1

27

Hailey Schoelkopf

@haileysch__

11 months

Connor Leahy has also called for *deleting GPT-4*. Falcon-180b and Llama-70b are open-weights, where’s the extinction risk this proposal fears from them?

2

0

27

Hailey Schoelkopf

@haileysch__

6 months

“finish lie” is definitely my current mood

Conference on Language Modeling

@COLM_conf

6 months

Pretty excited about today! What is happening?! Are you asking what is happening? Have you been living under a rock? ;) Good luck to all the COLM authors! We are excited to see all your hard work!

3

7

90

4

0

27

Hailey Schoelkopf

@haileysch__

11 months

if you’re at a big lab, and scaling + productizing LLMs—it is actively in your best interest to push the overton window back, and ensure near-doom views like this don’t get any airtime

0

26

Hailey Schoelkopf

@haileysch__

2 years

This is really cool work! Turns out you don't just have to shove all your data at once into LM pretraining and hope for the best...

Tomek Korbak

@tomekkorbak

2 years

You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors, including under adversarial attack, w/o hurting downstream performance.

7

95

586

1

3

25

Hailey Schoelkopf

@haileysch__

7 months

Context Distillation still seems criminally underlooked by the OSS community imo

4

1

25

Hailey Schoelkopf

@haileysch__

2 months

my talk on evaluation impl. details for the Mastering LLMs course is now public! cool move by @HamelHusain — also check out some of the other talks like @fly_upside_down discussing the awesome Inspect AI eval framework!

Hamel Husain

@HamelHusain

2 months

Link to blog post: In the post, we tell you how to get the most out of the course, what to expect, and how to navigate the materials. We are still adding a few lessons, but 95% of them are on the site. This is a unique course with 30+ legendary

6

59

352

0

2

23

Hailey Schoelkopf

@haileysch__

8 months

EleutherAI will be collaborating on the newly launched US AI Safety Institute Consortium! Excited for us to work with @NIST to further the science of AI evaluation.

EleutherAI

@AiEleuther

8 months

EleutherAI is excited to collaborate with NIST in its newly formed AI Safety Institute Consortium (AISIC) to establish a new measurement science for safe AI systems. See the official announcement here: #AISIC @NIST @CommerceGov

1

5

37

1

24

Hailey Schoelkopf

@haileysch__

1 year

looking at the data continues to be one of the most high-impact things you can do in ML. great work done by @ebriakou looking into the effects of pretraining data on PaLM’s “zero-shot” translation abilities!

Eleftheria Briakou

@ebriakou

1 year

LLMs exhibit translation capabilities despite having never seen intentionally-included translation examples, so... where do those capabilities come from? 🚨We show that incidental bilingualism connects to the machine translation capabilities of PaLM. 📜

13

98

530

0

1

23

Hailey Schoelkopf

@haileysch__

14 days

I will be at PyTorch Conf all tomorrow! get in touch if you wanna say hi, or swing by the panel I’m on @ 4:45 !

Kartikay Khandelwal

@kakemeister

1 month

And finally, I have the pleasure of hosting an exciting panel discussion with @Tim_Dettmers , @haileysch__ , @achowdhery and @alex_conneau on everything from memory efficiency and PEFT to Multimodal LLMs and Agents. We’ll keep things spicy :) What should I ask them?

2

13

0

1

23

Hailey Schoelkopf

@haileysch__

2 months

honored to be a part of this @PrincetonPLI event thinking deeply about AI agents! excited to learn a lot from this star-studded speaker list :)

Sayash Kapoor

@sayashk

2 months

Agents are an active research area. But to be useful in the real world, they must be accurate, reliable, and cheap. Join our workshop on August 29 to learn from the creators of LangChain, DSPy, SWE-Bench, lm-eval-harness, Reflexion, SPADE and more. RSVP:

8

42

211

0

1

23

Hailey Schoelkopf

@haileysch__

1 year

nervously checking W&B at the function

0

1

22

Hailey Schoelkopf

@haileysch__

1 year

@norabelrose - massively understate the surveillance and enforcement needed to ban general purpose computing

3

0

23

Hailey Schoelkopf

@haileysch__

4 months

omg, can’t wait to read this

Andrew Gordon Wilson

@andrewgwils

4 months

A lot of the computation in pre-training transformers is now spent in the dense linear (MLP) layers. In our new ICML paper, we propose matrix structures with better scaling laws! w/ @ShikaiQiu , Andres P, @m_finzi , @micahgoldblum 1/8

8

79

537

2

0

23

Hailey Schoelkopf

@haileysch__

10 months

Gemini is very cool + impressive, but a note on evals: the uncertainty-routed CoT setting further assumes access to a validation set, in addition to the 5 shots given. This should be taken into account, as tuning on extra examples can boost perf a lot!

1

22

Hailey Schoelkopf

@haileysch__

1 year

you absolutely love to see it FA2 Is All You Need but memory costs are quite steep—maybe it’s time for at-scale gisting () or blockwise parallelism () ?

Blockwise Parallel Transformer for Large Context Models

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory...

arxiv.org

EnricoShippole

@EnricoShippole

1 year

Releasing Yarn-Llama-2-13b-128k, a Llama-2 model, trained for 128k context length using YaRN scaling. The model was trained in collaboration with u/bloc97 and @theemozilla of @NousResearch and @Void13950782 of @AiEleuther .

28

173

781

2

3

21

Hailey Schoelkopf

@haileysch__

4 months

Really cool work from @zmkzmkz on uptraining models to share KV cache across layers!

zed

@zmkzmkz

4 months

Finally, I present you my first ever preprint: MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding We show that sharing KV heads between layers allows for KV cache smaller than GQA/MQA allowed for, with reasonable acc/mem tradeoff

18

75

534

0

1

20

Hailey Schoelkopf

@haileysch__

1 year

@nearcyan it’s currently impossible to comply with the draft EU AI act, because many of the requirements require registry or info submission to an agency that doesn’t exist yet! so assessing current compliance is a bit of a misnomer.

3

1

20