Hailey Schoelkopf Profile Banner
Hailey Schoelkopf Profile
Hailey Schoelkopf

@haileysch__

3,821
Followers
973
Following
24
Media
695
Statuses

she/her | research scientist @aiEleuther | lm training + evals | LM Evaluation Harness maintainer

Boston
Joined June 2022
Don't wanna be here? Send us removal request.
@haileysch__
Hailey Schoelkopf
10 months
work in ML, they said. it’ll be fun, they said. Now I’m reading about the Based architecture and its HellaSwag score
31
47
1K
@haileysch__
Hailey Schoelkopf
1 year
Excited to announce our paper Pythia: A Suite for Analyzing LLMs across Training and Scaling has been accepted as an Oral paper at #ICML2023 !
Tweet media one
12
65
452
@haileysch__
Hailey Schoelkopf
4 months
My favorite bit in this paper: I and @bbrabbasi wrote an appendix formalizing what is done evaluating models with loglikelihood multiple choice and perplexity evals. afaik, none of this has been written up in one place in most papers and just been tacitly assumed before!
Tweet media one
@AiEleuther
EleutherAI
4 months
Excited to share our new paper, Lessons From The Trenches on Reproducible Evaluation of Language Models! In it, we discuss common challenges we’ve faced evaluating LMs, and how our library the Evaluation Harness is designed to mitigate them 🧵
Tweet media one
4
69
241
8
35
246
@haileysch__
Hailey Schoelkopf
1 year
Beyond excited for our work on Llemma to finally be public!! We trained very strong general math LMs (7B > Minerva 8B, 34B ~= Minerva 62B) and released them + training, eval, analysis code! Can’t wait to see the math+AI field pick these up for future developments in the open.
@zhangir_azerbay
Zhangir Azerbayev
1 year
We release Llemma: open LMs for math trained on up to 200B tokens of mathematical text. The performance of Llemma 34B approaches Google's Minerva 62B despite having half the parameters. Models/data/code: Paper: More ⬇️
Tweet media one
11
126
549
3
28
204
@haileysch__
Hailey Schoelkopf
11 months
This proposal suggests a global moratorium on training runs > 10^24 flops—approximately that used by llama 2-70B, and 100x lower than the EO reporting threshold. Just astoundingly absurd
@_andreamiotti
Andrea Miotti
11 months
The AI Summit consensus is clear: it's time for international measures. Here is a concrete proposal. In our recent paper, @jasonhausenloy , Claire Dennis and I propose an international institution to address extinction risk from AI: MAGIC, a Multinational AGI Consortium.
Tweet media one
103
21
112
13
13
178
@haileysch__
Hailey Schoelkopf
7 months
so, it turns out i am the top solo user for downloads on HF (thanks solely to my lm-eval MMLU mirror! 😅) now is probably a good time to express how grateful i am for the users, contributors, and community around the LM Evaluation Harness! :’)
Tweet media one
6
18
174
@haileysch__
Hailey Schoelkopf
10 months
so that I can use it in Oogabooga
3
0
145
@haileysch__
Hailey Schoelkopf
1 year
We’ve released the paper for Pythia, a set of LLMs designed to facilitate scientific study on LLMs and their training data! So excited to have this out finally! Read more here:
@BlancheMinerva
Stella Biderman
1 year
Have you ever wanted to do an experiment on LLMs and found that none of the existing model suites met your needs? At @AiEleuther we got tired of this happening and so designed a model suite that centers enabling scientific research as its primary goal
11
183
888
4
17
124
@haileysch__
Hailey Schoelkopf
5 months
what’s the current SOTA for KV cache compression? what are some must-read papers on this topic?
12
9
127
@haileysch__
Hailey Schoelkopf
7 months
👀it's always incredible to me just how ubiquitous and clear the induction head bump is
Tweet media one
5
6
118
@haileysch__
Hailey Schoelkopf
1 year
Presenting Pythia at board # 609 at 10:30 am today! Come by to talk LLM training, enabling interp + novel data effect studies, open science, and more!
Tweet media one
4
14
106
@haileysch__
Hailey Schoelkopf
20 days
where were you when GPQA was killed
7
6
105
@haileysch__
Hailey Schoelkopf
7 months
the parallel generation trick in Quiet-STaR is really neat and I wish I’d thought of it
Tweet media one
3
4
99
@haileysch__
Hailey Schoelkopf
1 year
Hey that’s me! :) Had a great time talking with @MichaelTrazzi about our team’s work on the Pythia LLMs while at #ICML2023 .
2
6
100
@haileysch__
Hailey Schoelkopf
1 year
I'm glad Deepspeed maintainers are as confused as I am by the library internals
Tweet media one
2
5
92
@haileysch__
Hailey Schoelkopf
2 years
Come check out what we’ve been working on recently: a set of LLM checkpoints over time (16 models x 143 checkpoints each) with consistent data ordering across model scales! available now
@AiEleuther
EleutherAI
2 years
What do LLMs learn over the course of training? How do these patterns change as you scale? To help answer these questions, we are releasing a Pythia, suite of LLMs + checkpoints specifically designed for research on interpretability and training dynamics!
4
87
473
4
14
93
@haileysch__
Hailey Schoelkopf
5 months
MLA TL;DR(?): - low-rank projs for Q, K, V, k+v share down-proj(!) - no pos. enc - for RoPE compat. : stack this with MQA that uses a smaller head dim. , RoPE on the MQA intuition for why this works: like only applying RoPE to 25% of head dim in MHA?
@deepseek_ai
DeepSeek
5 months
DeepSeek-V2 is a strong, economical, and efficient MoE language model, enhanced with exceptional architectural designs in attention mechanisms and sparse layers: 🌟 MLA (Multi-head Latent Attention): a better and faster attention that ensures efficient inference via reducing KV
Tweet media one
Tweet media two
Tweet media three
5
19
163
3
9
94
@haileysch__
Hailey Schoelkopf
1 year
I used to really be gunning for prefixLM / UL2 but I'm no longer sold on it. Maybe it wins out by a couple points on MMLU, but it's a lot more hparams to tweak and more expensive for generation / multi-turn settings. Causal is cheap, good, easy. This looks like cool work though!
@_akhaliq
AK
1 year
CausalLM is not optimal for in-context learning paper page: Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each
Tweet media one
7
86
415
2
8
94
@haileysch__
Hailey Schoelkopf
9 months
Support for Mamba has landed in lm-evaluation-harness! Use it with `--model mamba_ssm` : Was really happy to see @_albertgu @tri_dao provide support for our new release natively alongside their architecture code, to benchmark against Pythia reproducibly!
Tweet media one
3
5
81
@haileysch__
Hailey Schoelkopf
1 year
If your LM training company finds that evals are broken (surprise!), and decides to develop your own higher quality test sets— consider open-sourcing them! i will help you.
8
6
80
@haileysch__
Hailey Schoelkopf
5 months
@tamaybes a super-fun arcane historical detail: Gopher (and by extension Chinchilla) use Transformer-XL style position encodings. This means they spend 20B params (Gopher) and 5B params (Chinchilla) on just rel. position encoding!
1
6
80
@haileysch__
Hailey Schoelkopf
10 months
Giving a talk tomorrow 12pm EST at @gordic_aleksa 's server! Come to hear me chat about Pythia, Llemma, LM-Eval-Harness, and producing infrastructure for future OSS research exploration!
@gordic_aleksa
Aleksa Gordić 🍿🤖
10 months
Hailey Schoelkopf ( @haileysch__ ) is giving a talk tomorrow in my server! :) Hailey is/was part of many impactful projects such as BLOOM, Pythia, she's also core dev on @AiEleuther 's lm-evaluation-harness, etc talk title: pythia - research infra for LLMs discord:
Tweet media one
0
2
27
1
7
71
@haileysch__
Hailey Schoelkopf
2 months
reread the whole gopher paper and there are some really great "precursor" results (including negative results), predating some of the more recent trainng trends, in their appendices that I'd somehow forgotten?
2
6
71
@haileysch__
Hailey Schoelkopf
1 year
banning LMs trained on > 1T tokens is a surefire way to make sure byte-level never makes a comeback
@norabelrose
Nora Belrose
1 year
I'm opposed to any AI regulation based on absolute capability thresholds, as opposed to indexing to some fraction of state-of-the-art capabilities. The Center for AI Policy is proposing thresholds which already include open source Llama 2 (7B). This is ridiculous.
Tweet media one
56
37
406
7
2
70
@haileysch__
Hailey Schoelkopf
1 year
There’s a large literature on memorization in LLMs—but how can we move this theory into practice? We show how LLM trainers can treat memorization as a first-class concern in training, and put forward the first attempts at predicting precise behavior before training a model!
@BlancheMinerva
Stella Biderman
1 year
Not all memorization in LLMs is created equal. Some strings are bad to memorize, but most are quite benign. In our new paper we ask the questions "can we tell ahead of time which text will be memorized by a LLM?" and "can we change our mindset to make the answer yes?" A 🧵
Tweet media one
11
101
568
2
9
67
@haileysch__
Hailey Schoelkopf
7 months
the existence of Based and Mamba implies in a few months someone will put out an arch we have to call “Based Mamba” until the end of time
12
0
68
@haileysch__
Hailey Schoelkopf
1 year
I'll be at #ICML2023 next week! I'd love to chat about LLM training + infra, interp, evaluation, and tracing model behavior back to the training data! DMs are open :)
2
4
68
@haileysch__
Hailey Schoelkopf
10 months
Worth noting that GPT-4 was trained on a majority of the GSM8k training set intentionally. We don’t have a full understanding of synthetic data and how it passes down contamination, but good to remember when using GPT-4 or GPT-3.5 data for finetuning.
@arankomatsuzaki
Aran Komatsuzaki
10 months
TinyGSM: achieving >80% on GSM8k with small language models A duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5% accuracy, outperforming existing models that are orders of magnitude larger
Tweet media one
1
42
183
3
4
66
@haileysch__
Hailey Schoelkopf
3 months
native int8 training confirmed! really excited to see Character starting to share more technical details
@NoamShazeer
Noam Shazeer
3 months
Character AI is serving 20,000 QPS. Here are the technologies we use to serve hyper-efficiently. [ ]
31
199
1K
2
5
64
@haileysch__
Hailey Schoelkopf
4 months
So excited for this to be released! We’ve written a paper describing what we’ve learned from building the Eval Harness, including general advice on evaluation best practices for LMs.
@AiEleuther
EleutherAI
4 months
Excited to share our new paper, Lessons From The Trenches on Reproducible Evaluation of Language Models! In it, we discuss common challenges we’ve faced evaluating LMs, and how our library the Evaluation Harness is designed to mitigate them 🧵
Tweet media one
4
69
241
2
4
62
@haileysch__
Hailey Schoelkopf
2 years
this is a point really worth hammering home—chinchilla laws don’t matter where people think they do. if you try to train a 1B param LM on only 20B tokens you *are* going to have a bad time!
@BlancheMinerva
Stella Biderman
2 years
@MetaAI @GuillaumeLample Another thing I anticipate being a massive source of confusion: Their smaller models are massively overtrained. The fact that their 13B model meets or exceeds GPT-3 in performance is NOT contradictory to anything in the Chinchilla paper, because it's not compute-optimally trained
5
9
94
1
1
60
@haileysch__
Hailey Schoelkopf
9 months
Anthropic: “Users of a large language model may not know about hidden backdoors in the model if they lack access to a model’s parameters or a full understanding of its training process and dataset”
4
4
61
@haileysch__
Hailey Schoelkopf
6 months
MoD(E) looks like great work and I’m really glad it was published! but saving 50% of flops for (small batch) inference may not be a straight 2x speedup. Early-exit a la CALM doesn’t give reliable speedups for bs>1: if any tokens don’t exit early, you have to wait on them
@arankomatsuzaki
Aran Komatsuzaki
6 months
Google presents Mixture-of-Depths Dynamically allocating compute in transformer-based language models Same performance w/ a fraction of the FLOPs per forward pass
Tweet media one
6
90
611
9
1
59
@haileysch__
Hailey Schoelkopf
6 months
breaking my "agents" silence: this interface design requires: - being realistic about current LLMs' limitations (> 100 LoC confuses them!) - making tasks easier for LLMs by making them closer to what they've seen often in pretraining (see: Embers of Autoregression) fascinating
@jyangballin
John Yang
6 months
Simply connecting an LM to a vanilla bash terminal does not work well. Our key insight is that LMs require carefully designed agent-computer interfaces (similar to how humans like good UI design) E.g. When the LM messes up indentation, our editor prevents it and gives feedback
Tweet media one
2
33
260
1
2
59
@haileysch__
Hailey Schoelkopf
3 months
I will die on the “Mistral-7B did not secretly use prefixLM” hill until or unless it’s explicitly confirmed otherwise
5
1
57
@haileysch__
Hailey Schoelkopf
5 months
@tamaybes but current models don’t allocate parameters to rotary embs! this means the Chinchilla D=20*N is skewed already for the actual param counts of most models, even if it held across datasets! If we disregarded the pos. encoding params the coefficients would change
2
2
56
@haileysch__
Hailey Schoelkopf
1 year
a machine will never understand commutativity
@OwainEvans_UK
Owain Evans
1 year
Does a language model trained on “A is B” generalize to “B is A”? E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?” Our new paper shows they cannot!
Tweet media one
175
707
4K
3
6
55
@haileysch__
Hailey Schoelkopf
7 months
even if you’re not using subquadratic architectures for your frontier training run, you should totally be using them for speculative decoding… pic unrelated
Tweet media one
1
6
56
@haileysch__
Hailey Schoelkopf
1 year
@ilex_ulmus @BlancheMinerva @OntolMQ 2 trillion tokens is not a lot, and the Falcon team (models released by the UAE) was easily able to put 5 trillion tokens together with few engineers. There are no insurmountable barriers in current LM training for even moderately well-funded entities, let alone state actors.
2
2
52
@haileysch__
Hailey Schoelkopf
4 months
Excited for this new collab with @RylanSchaeffer to be out! Why are downstream evals harder to predict with scale than pretraining loss? for loglikelihood-based MCQA, we find an explanation!
@RylanSchaeffer
Rylan Schaeffer
4 months
❤️‍🔥❤️‍🔥Excited to share our new paper ❤️‍🔥❤️‍🔥 **Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?** w/ @haileysch__ @BrandoHablando @gabemukobi @varunrmadan @herbiebradley @ai_phd @BlancheMinerva @sanmikoyejo 1/N
Tweet media one
8
54
262
1
5
52
@haileysch__
Hailey Schoelkopf
8 months
Some really cool inventive “wacky” techniques from the OpenBMB team (annealing data contents in training + instruct data in cooldown! atypical infinite LR schedules! scaling embedding by… 12!) it’s clear we haven’t even begun to exhaust the possibilities of LLM training.
@OpenBMB
OpenBMB
8 months
MiniCPM Blog : MiniCPM: Unveiling the Potential of End-side Large Language Models. #MiniCPM -2B: An end-side LLM outperforms Llama2-13B @huggingface @_akhaliq @Xianbao_QIAN
4
17
69
4
8
51
@haileysch__
Hailey Schoelkopf
3 months
I'll be at @icmlconf all week next week! Especially interested in chatting about: - evaluations: infra, missing evals, what's next - systems optimizations for distributed training+inference - predictable scaling (scaling laws, eval forecasting...) DM if you'd like to chat! :)
0
3
50
@haileysch__
Hailey Schoelkopf
1 year
it’s really incredible to have more openly available datsets, congrats to @soldni @allen_ai for the heroic effort on this release! this also marks (to my knowledge) only the third LLM pretraining corpus to provide a datasheet :) very glad that list is longer than last year’s!
@soldni
Luca Soldaini 🎀
1 year
Announcing Dolma, the dataset for @allen_ai 's LLM, OLMo. It's 3+ trillion tokens (web/papers/code/books/wiki). We hope it will facilitate study of LLMs & their behavior! Released on @huggingface w ImpACT license Overview/datasheet
Tweet media one
21
144
558
2
7
50
@haileysch__
Hailey Schoelkopf
1 year
Belated announcement: honored to have presented at @MITFutureTech workshop on AI Scaling! I talked about how we scale LLMs today, and how it might look in a few years.
@MITFutureTech
MIT FutureTech
1 year
Hailey Schoelkopf @haileysch__ from @aiEleuther discussing the future of LLMs. #MITAIScaling @MITFutureTech
Tweet media one
0
0
8
1
6
49
@haileysch__
Hailey Schoelkopf
16 days
one must imagine paris hilton getting 80% MFU
Tweet media one
@mrdrozdov
Andrew Drozdov
18 days
stop 🤯
Tweet media one
0
2
23
4
0
48
@haileysch__
Hailey Schoelkopf
4 months
the real move would be to A/B test Golden Gate Claude versus a sysprompt
@AnthropicAI
Anthropic
4 months
This week, we showed how altering internal "features" in our AI, Claude, could change its behavior. We found a feature that can make Claude focus intensely on the Golden Gate Bridge. Now, for a limited time, you can chat with Golden Gate Claude:
Tweet media one
110
264
2K
3
3
48
@haileysch__
Hailey Schoelkopf
1 year
We wrote a blog post on the napkin math that goes into training LLMs at scale! Check it out here:
@AiEleuther
EleutherAI
1 year
The most common question we get about our models is "will X fit on Y GPU?" This, and many more questions about training and inferring with LLMs, can be answered with some relatively easy math. By @QuentinAnthon15 , @BlancheMinerva , and @haileysch__
12
102
509
1
2
46
@haileysch__
Hailey Schoelkopf
2 months
really cool in-depth exploration of prompt sensitivity for llm evals! wish I’d known about this paper earlier
@m_ryabinin
Max Ryabinin
2 months
In our new #ACL2024 paper, we show that LLMs remain sensitive to prompt formats even with improved few-shot techniques. Our findings suggest that careful evaluation needs to take this lack of robustness into account 📜: 🖥️:
Tweet media one
2
12
64
2
2
47
@haileysch__
Hailey Schoelkopf
3 months
New survey on "meta-generation" algorithms for generating text from LLMs with greater inference compute. Check out section 7 for a discussion of how methods to speed up generation interact with these complex algorithms, from Best-of-N to beam search to MCTS and beyond!
@wellecks
Sean Welleck
3 months
What do nucleus sampling, tree-of-thought, and PagedAttention have in common? They're all part of our new survey: "From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models"
Tweet media one
10
115
546
2
5
45
@haileysch__
Hailey Schoelkopf
7 months
I have developed a truly marvelous scaffolded LLM agent, a proof of which this margin is too narrow to contain…
0
3
45
@haileysch__
Hailey Schoelkopf
11 months
If there's anything less glamorous yet high-impact in ML than looking at the data, it's doing due diligence on licensing. This is incredible work!
@ShayneRedford
Shayne Longpre
11 months
📢Announcing the🌟Data Provenance Initiative🌟 🧭A rigorous public audit of 1800+ instruct/align datasets 🔍Explore/filter sources, creators & license conditions ⚠️We see a rising divide between commercially open v closed licensed data 🌐: 1/
10
148
463
0
10
45
@haileysch__
Hailey Schoelkopf
1 year
full of joy post-ICML meeting old and new friends :) If we missed each other this week, definitely reach out!
Tweet media one
1
1
44
@haileysch__
Hailey Schoelkopf
2 months
Our "Challenges in LM Evaluation" ICML24 tutorial slides are now public! Thanks for the feedback from everyone who attended :)
@AiEleuther
EleutherAI
2 months
We were very happy with the reception to our researchers @lintangsutawika and @haileysch__ 's ICML tutorial, "Challenges in LM Evaluation", this past week! For all those who requested it, the slides are now available at . Enjoy!
1
12
36
0
2
41
@haileysch__
Hailey Schoelkopf
1 month
@WenhuChen this should be an artifact of the induction head bump which shows up in most transformer loss curves!
@haileysch__
Hailey Schoelkopf
7 months
👀it's always incredible to me just how ubiquitous and clear the induction head bump is
Tweet media one
5
6
118
2
0
42
@haileysch__
Hailey Schoelkopf
2 years
Drago was such a kind, passionate mentor who poured so much time and effort into helping students at all levels. His devotion to the field really showed by his actions. Stunned by his loss, I would not be working in NLP now if it were not for him. May he rest in peace.
@hmkyale
Harlan Krumholz
2 years
The #AI community, the #computerscience community, the @YaleSEAS community, and humanity have suddenly lost a remarkable person, @dragomir_radev - kind and brilliant, devoted to his family and friends... gone too soon. A sad day @Yale @YINSedge @YaleCompsci #NLP2023
Tweet media one
Tweet media two
41
87
388
1
3
39
@haileysch__
Hailey Schoelkopf
6 months
@andersonbcdefg you're moving fast on this huh
@andersonbcdefg
Ben (e/treats)
6 months
so you can't use DBRX to improve other LLMs... but they never said you can't use it to make them Worse
10
1
133
1
0
39
@haileysch__
Hailey Schoelkopf
8 months
New paper out, streamlining the RLAIF setup presented in Constitutional AI to use critiques + revisions directly as natural language feedback! Hope to see more work using this to effectively constrain LLM-based systems’ behavior!
@synth_labs
SynthLabs
8 months
We also present Direct Principle Feedback (DPF) as a way to address this. Rather than relying on reranking, we can use the before/after of a revision as a pairwise prefs. 5/N
Tweet media one
1
4
18
1
6
37
@haileysch__
Hailey Schoelkopf
6 months
@Teknium1 the next time someone puts out a benchmark like this, they should secretly hold out a 2nd test set, then evaluate the SOTA methods on it 6-12 months later to see how much overfitting / purposeful hill climbing was done on the public one
2
0
37
@haileysch__
Hailey Schoelkopf
10 months
@JacquesThibs yes they were
0
0
35
@haileysch__
Hailey Schoelkopf
2 months
Proud to have played a very small part in this large-scale audit of 14,000+ domains and their robots.txt consent policies in public pretraining corpora!
@ShayneRedford
Shayne Longpre
2 months
✨New Preprint ✨ How are shifting norms on the web impacting AI? We find: 📉 A rapid decline in the consenting data commons (the web) ⚖️ Differing access to data by company, due to crawling restrictions (e.g.🔻26% OpenAI, 🔻13% Anthropic) ⛔️ Robots.txt preference protocols
Tweet media one
10
95
255
0
2
34
@haileysch__
Hailey Schoelkopf
1 year
Excited to see how this goes, and I will definitely submit some of my work here! I’m a little concerned this will only increase insularity and ignorance of work prior to LLMs, but glad the scope is large in what is considered “language modeling”!
@srush_nlp
Sasha Rush
1 year
Introducing COLM () the Conference on Language Modeling. A new research venue dedicated to the theory, practice, and applications of language models. Submissions: March 15 (it's pronounced "collum" 🕊️)
Tweet media one
34
434
2K
2
3
34
@haileysch__
Hailey Schoelkopf
10 months
apropos of nothing: there’s no post-Chinchilla (published) scaling law research on MoE models. what’s the optimal tokens::dense params::sparse params ratio for a given FLOP budget? are MoEs more data hungry? how much? if you’re interested in working on this, let’s get in touch!
3
0
34
@haileysch__
Hailey Schoelkopf
6 months
Torchtune is shipping with LM Evaluation Harness integration for evals of finetunes! Excited to see lm-eval adopted by the ecosystem—evals are crucial. we ( @lintangsutawika and I) are looking forward to collaborating with the torchtune team to build out deeper integration!
@kakemeister
Kartikay Khandelwal
6 months
Really excited to officially release torchtune: a PyTorch-native library for easily fine-tuning LLMs! Code: Blog: Tutorials: [1/5]
4
78
337
1
4
33
@haileysch__
Hailey Schoelkopf
1 year
@deliprao yet another reason to distrust “emergent abilities” with LLM scale—we have no way of knowing if generational improvements of models (GPT-3.5->GPT-4) are due to compute expended or just more expert data!
5
2
31
@haileysch__
Hailey Schoelkopf
3 months
Come say hi at ICML and hear what @AiEleuther team and community members have been up to!
@AiEleuther
EleutherAI
3 months
Looking for EleutherAI at @icmlconf ? Come meet our community and check out their fabulous work. Featuring (in order of appearance): @lintangsutawika @haileysch__ @aviskowron @BlancheMinerva @Vermeille_ @Void13950782 @dashstander @qinan_yu @norabelrose @CurtTigges
Tweet media one
Tweet media two
2
13
52
1
1
30
@haileysch__
Hailey Schoelkopf
6 months
there are a lot more extant Triton kernels out there now than there were even 6 months ago. are they listed anywhere already? if not, would collecting links to these be useful for the community?
4
0
31
@haileysch__
Hailey Schoelkopf
9 months
@gneubig at the current stage “aligned” in most finetuning papers doesn’t mean anything different from “instruction tuned”, though it somewhat connotes preference or safety fine-tuning
1
2
30
@haileysch__
Hailey Schoelkopf
8 months
so do i have to learn how NeRFs work now
@billpeeb
Bill Peebles
8 months
welcome to bling zoo! this is a single video generated by sora, shot changes and all.
202
578
4K
6
1
30
@haileysch__
Hailey Schoelkopf
1 year
No more guesswork needed to assess contamination, with public training data! Ruling out contamination with web-scale data is nigh impossible so documentation + replicability is crucial to understand + measure it better.
@zhangir_azerbay
Zhangir Azerbayev
1 year
Finally, we seek to quantify the effect of memorization. Surprisingly, we find that Llemma is no more accurate on problems that appear in its training set. Because our code and data are open, we encourage others to replicate and extend our analysis. 9/n
Tweet media one
1
1
33
2
1
29
@haileysch__
Hailey Schoelkopf
4 months
want to do a deep read of mamba-2 immediately but fighting against time for neurips datasets and benchmarks ;-; there are dozens of us!
4
0
29
@haileysch__
Hailey Schoelkopf
4 months
Thanks @hugobowne @HamelHusain for hosting me! I’ll be giving a talk about the gritty details of evaluation, don’t miss it :)
@hugobowne
Hugo Bowne-Anderson
4 months
💫We're super excited to announce a new speaker: @haileysch__ (from @AiEleuther & LM Eval harness) will be speaking on A Deep Dive on LLM Evaluation -- can't wait for this one!
4
5
50
1
4
29
@haileysch__
Hailey Schoelkopf
1 year
@norabelrose gating things based on MMLU % is *crazy*
3
1
28
@haileysch__
Hailey Schoelkopf
7 months
v0.4.2 of lm-evaluation-harness is now available on PyPI! Very happy about the contributions here and how lm-eval has been used lately by the community :))
@AiEleuther
EleutherAI
7 months
A new minor version release, 0.4.2, of the lm-evaluation-harness is available on PyPI! 1/n
1
6
35
0
1
29
@haileysch__
Hailey Schoelkopf
6 months
very cool work analyzing the properties of small LMs using Pythia!
@nthngdy
Nathan Godey @COLM 🇺🇸
6 months
🤏 Why do small Language Models underperform? We prove empirically and theoretically that the LM head on top of language models can limit performance through the softmax bottleneck phenomenon, especially when the hidden dimension <1000. 📄Paper: (1/10)
Tweet media one
15
125
604
0
0
28
@haileysch__
Hailey Schoelkopf
2 months
Come to @lintangsutawika and I’s tutorial this afternoon at ICML!
@lintangsutawika
Lintang Sutawika
2 months
Catch our tutorial at ICML today at Lehar 1-4 from 3:30-5:30 pm!
Tweet media one
1
5
36
0
3
28
@haileysch__
Hailey Schoelkopf
2 years
We released our first LM @carperai , testing the ability of models trained on FIM to infill code! Much more forthcoming from us, putting these results to use in better code LMs for pair programming very soon ;)
1
2
28
@haileysch__
Hailey Schoelkopf
10 months
goodbye MFU, hello goodput
Tweet media one
4
0
28
@haileysch__
Hailey Schoelkopf
5 months
Rigorously evaluating “agents” takes thought! great work debunking the (cost-normalized) performance of popular coding agents by @random_walker @sayashk @benediktstroebl .
@random_walker
Arvind Narayanan
5 months
On tasks like coding we can keep increasing accuracy by indefinitely increasing inference compute, so leaderboards are meaningless. The HumanEval accuracy-cost Pareto curve is entirely zero-shot models + our dead simple baseline agents. New research w @sayashk @benediktstroebl 🧵
Tweet media one
5
33
196
0
4
27
@haileysch__
Hailey Schoelkopf
3 months
I’m really excited about GoldFinch-style archs for local models. Like YOCO, it only has 1 global KV cache—but adds a 16x K-cache compression and avoids caching values entirely (2x compression)! And perf is still good. KV cache too cheap to meter! Congrats to @smerkyg et al!
@smerkyg
Dan Goldstein
3 months
Our new paper on the GoldFinch 🐦 hybrid Transformer architecture just dropped 🐣 at ! GoldFinch 🐦 combines the best parts of Linear Attention (via RWKV) and traditional Transformers to create something that is better than either one on its own!
Tweet media one
7
38
122
0
2
25
@haileysch__
Hailey Schoelkopf
1 year
@round I recently learned about hinton diagrams:
Tweet media one
2
1
27
@haileysch__
Hailey Schoelkopf
11 months
Connor Leahy has also called for *deleting GPT-4*. Falcon-180b and Llama-70b are open-weights, where’s the extinction risk this proposal fears from them?
2
0
27
@haileysch__
Hailey Schoelkopf
6 months
“finish lie” is definitely my current mood
@COLM_conf
Conference on Language Modeling
6 months
Pretty excited about today! What is happening?! Are you asking what is happening? Have you been living under a rock? ;) Good luck to all the COLM authors! We are excited to see all your hard work!
Tweet media one
3
7
90
4
0
27
@haileysch__
Hailey Schoelkopf
11 months
if you’re at a big lab, and scaling + productizing LLMs—it is actively in your best interest to push the overton window back, and ensure near-doom views like this don’t get any airtime
0
0
26
@haileysch__
Hailey Schoelkopf
2 years
This is really cool work! Turns out you don't just have to shove all your data at once into LM pretraining and hope for the best...
@tomekkorbak
Tomek Korbak
2 years
You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors, including under adversarial attack, w/o hurting downstream performance.
Tweet media one
7
95
586
1
3
25
@haileysch__
Hailey Schoelkopf
7 months
Context Distillation still seems criminally underlooked by the OSS community imo
4
1
25
@haileysch__
Hailey Schoelkopf
2 months
my talk on evaluation impl. details for the Mastering LLMs course is now public! cool move by @HamelHusain — also check out some of the other talks like @fly_upside_down discussing the awesome Inspect AI eval framework!
@HamelHusain
Hamel Husain
2 months
Link to blog post: In the post, we tell you how to get the most out of the course, what to expect, and how to navigate the materials. We are still adding a few lessons, but 95% of them are on the site. This is a unique course with 30+ legendary
6
59
352
0
2
23
@haileysch__
Hailey Schoelkopf
8 months
EleutherAI will be collaborating on the newly launched US AI Safety Institute Consortium! Excited for us to work with @NIST to further the science of AI evaluation.
@AiEleuther
EleutherAI
8 months
EleutherAI is excited to collaborate with NIST in its newly formed AI Safety Institute Consortium (AISIC) to establish a new measurement science for safe AI systems. See the official announcement here: #AISIC @NIST @CommerceGov
1
5
37
1
1
24
@haileysch__
Hailey Schoelkopf
1 year
looking at the data continues to be one of the most high-impact things you can do in ML. great work done by @ebriakou looking into the effects of pretraining data on PaLM’s “zero-shot” translation abilities!
@ebriakou
Eleftheria Briakou
1 year
LLMs exhibit translation capabilities despite having never seen intentionally-included translation examples, so... where do those capabilities come from? 🚨We show that incidental bilingualism connects to the machine translation capabilities of PaLM. 📜
Tweet media one
13
98
530
0
1
23
@haileysch__
Hailey Schoelkopf
14 days
I will be at PyTorch Conf all tomorrow! get in touch if you wanna say hi, or swing by the panel I’m on @ 4:45 !
@kakemeister
Kartikay Khandelwal
1 month
And finally, I have the pleasure of hosting an exciting panel discussion with @Tim_Dettmers , @haileysch__ , @achowdhery and @alex_conneau on everything from memory efficiency and PEFT to Multimodal LLMs and Agents. We’ll keep things spicy :) What should I ask them?
2
2
13
0
1
23
@haileysch__
Hailey Schoelkopf
2 months
honored to be a part of this @PrincetonPLI event thinking deeply about AI agents! excited to learn a lot from this star-studded speaker list :)
@sayashk
Sayash Kapoor
2 months
Agents are an active research area. But to be useful in the real world, they must be accurate, reliable, and cheap. Join our workshop on August 29 to learn from the creators of LangChain, DSPy, SWE-Bench, lm-eval-harness, Reflexion, SPADE and more. RSVP:
Tweet media one
8
42
211
0
1
23
@haileysch__
Hailey Schoelkopf
1 year
nervously checking W&B at the function
0
1
22
@haileysch__
Hailey Schoelkopf
1 year
@norabelrose - massively understate the surveillance and enforcement needed to ban general purpose computing
3
0
23
@haileysch__
Hailey Schoelkopf
4 months
omg, can’t wait to read this
@andrewgwils
Andrew Gordon Wilson
4 months
A lot of the computation in pre-training transformers is now spent in the dense linear (MLP) layers. In our new ICML paper, we propose matrix structures with better scaling laws! w/ @ShikaiQiu , Andres P, @m_finzi , @micahgoldblum 1/8
Tweet media one
8
79
537
2
0
23
@haileysch__
Hailey Schoelkopf
10 months
Gemini is very cool + impressive, but a note on evals: the uncertainty-routed CoT setting further assumes access to a validation set, in addition to the 5 shots given. This should be taken into account, as tuning on extra examples can boost perf a lot!
Tweet media one
1
1
22
@haileysch__
Hailey Schoelkopf
1 year
you absolutely love to see it FA2 Is All You Need but memory costs are quite steep—maybe it’s time for at-scale gisting () or blockwise parallelism () ?
@EnricoShippole
EnricoShippole
1 year
Releasing Yarn-Llama-2-13b-128k, a Llama-2 model, trained for 128k context length using YaRN scaling. The model was trained in collaboration with u/bloc97 and @theemozilla of @NousResearch and @Void13950782 of @AiEleuther .
Tweet media one
28
173
781
2
3
21
@haileysch__
Hailey Schoelkopf
4 months
Really cool work from @zmkzmkz on uptraining models to share KV cache across layers!
@zmkzmkz
zed
4 months
Finally, I present you my first ever preprint: MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding We show that sharing KV heads between layers allows for KV cache smaller than GQA/MQA allowed for, with reasonable acc/mem tradeoff
Tweet media one
18
75
534
0
1
20
@haileysch__
Hailey Schoelkopf
1 year
@nearcyan it’s currently impossible to comply with the draft EU AI act, because many of the requirements require registry or info submission to an agency that doesn’t exist yet! so assessing current compliance is a bit of a misnomer.
3
1
20