Amanda Bertsch Profile Banner
Amanda Bertsch Profile
Amanda Bertsch

@abertsch72

1,407
Followers
778
Following
23
Media
299
Statuses

PhD student @LTIatCMU / @SCSatCMU , researching text generation + summarization | she/her | also @ abertsch on bsky or or by email ()

Joined August 2014
Don't wanna be here? Send us removal request.
Pinned Tweet
@abertsch72
Amanda Bertsch
4 months
In-context learning provides an LLM with a few examples to improve accuracy. But with long-context LLMs, we can now use *thousands* of examples in-context. We find that this long-context ICL paradigm is surprisingly effective– and differs in behavior from short-context ICL! 🧵
Tweet media one
14
108
570
@abertsch72
Amanda Bertsch
1 year
What if we could run Transformer models without worrying about context length? With our new work Unlimiformer, you can jailbreak your current models to use unlimited length inputs! Preprint: Thread 🧵 (1/6)
19
162
953
@abertsch72
Amanda Bertsch
11 months
What do decoding methods including self-consistency, output ensembling, and range voting have in common? They’re all variants of Minimum Bayes Risk (MBR) decoding! This useful and easy-to-apply method generalizes many modern generation techniques! 👇🧵
Tweet media one
1
40
190
@abertsch72
Amanda Bertsch
1 year
Now accepted to NeurIPS'23! Looking forward to talking to folks in New Orleans 🎉
@abertsch72
Amanda Bertsch
1 year
What if we could run Transformer models without worrying about context length? With our new work Unlimiformer, you can jailbreak your current models to use unlimited length inputs! Preprint: Thread 🧵 (1/6)
19
162
953
3
12
66
@abertsch72
Amanda Bertsch
1 year
Unlimiformer is a retrieval-augmentation method for encoder-decoder models. It dynamically updates the context window at each decoding step, so that each head in each layer attends to its own top-k input tokens. Unlimiformer can be added to any pretrained encoder-decoder! (2/6)
Tweet media one
1
6
59
@abertsch72
Amanda Bertsch
2 years
What if you could get the benefits of extractive summarization in the dialogue domain? Check out our #EMNLP2022 Findings paper "He Said, She Said: Style Transfer for Shifting the Perspectives of Dialogues"! paper link: 🧵👇 (1/9)
1
8
51
@abertsch72
Amanda Bertsch
1 year
At test time, Unlimiformer can summarize *books* of more than 300k tokens without truncation! And inference cost increases sub-linearly with input length, so summarizing 100k token inputs is only 1.5x slower than summarizing 1k token inputs. (5/6)
Tweet media one
2
0
49
@abertsch72
Amanda Bertsch
9 months
I'm at NeurIPS this week! Would love to chat with folks interested in long contexts, fair evaluation, decoding, and/or philosophy of science. I'll also be presenting our paper Unlimiformer (below) in poster session 5: 10:45-12:45 Thursday. Check it out at board # 524!
@abertsch72
Amanda Bertsch
1 year
What if we could run Transformer models without worrying about context length? With our new work Unlimiformer, you can jailbreak your current models to use unlimited length inputs! Preprint: Thread 🧵 (1/6)
19
162
953
0
6
44
@abertsch72
Amanda Bertsch
9 months
Excited to see new and old friends this week in Singapore! Happy to chat about long context reasoning, summarization, decoding, philosophy of science, or finding good vegetarian food at hawker centers :) My collaborators and I are presenting several papers, starting today! ⬇️
1
8
34
@abertsch72
Amanda Bertsch
1 year
We reformulate attention for efficient retrieval using a single datastore, shared across all cross-attention heads in all decoder layers. This requires storing only a single vector per input token, allowing inputs of more than half a million tokens on a single GPU! (3/6)
Tweet media one
2
0
28
@abertsch72
Amanda Bertsch
1 year
Unlimiformer improves existing models without any fine-tuning, in cheap fine-tuning regimes, and with specialized training. Not only does it outperform other strong long-range transformers, but it can be applied on top of models such as LongFormer for further improvements! (4/6)
Tweet media one
1
0
21
@abertsch72
Amanda Bertsch
10 months
@nsaphra @enfleisig @kchonyc excited to read this, it looks really cool! you might also like our paper, which discusses some similar themes from a different methodological angle :)
@_sireesh
Sireesh Gururaja
11 months
We all know that “recently large language models have”, “large language models are”, and “large language models can.” But why LLMs? How did we get here? (where is “here”?) What forces are shaping NLP, and how recent are they, actually? To appear at EMNLP:
Tweet media one
6
66
290
1
2
12
@abertsch72
Amanda Bertsch
1 year
Check out our survey on using human feedback for generation!
@psanfernandes
Patrick Fernandes
1 year
*Human feedback* was the necessary secret sauce in making #chatgpt so human-like But what exactly is feedback? And how can we leverage it to improve our models? Check out our new survey on the use of (human) feedback in Natural Language Generation! 1/16
Tweet media one
8
101
437
0
0
12
@abertsch72
Amanda Bertsch
2 years
#NAACL2022 this week has been incredible! For folks still in Seattle, I’ll be co-presenting our paper “Evaluating Gender Bias Transfer from Film” at 4:15 in room 702-Clearwater, at the @genderbiasnlp workshop. (Not at NAACL or left a bit early? summary below ⬇️🧵)
1
1
11
@abertsch72
Amanda Bertsch
9 months
And finally, I'm so proud of our study of cultural and methodological shifts in NLP. What are the incentives that drive our field? How'd we wind up here? @claranahhh will be giving a talk on this work 12/10 at 11:30 in the West 2 Ballroom!
@_sireesh
Sireesh Gururaja
11 months
We all know that “recently large language models have”, “large language models are”, and “large language models can.” But why LLMs? How did we get here? (where is “here”?) What forces are shaping NLP, and how recent are they, actually? To appear at EMNLP:
Tweet media one
6
66
290
0
3
10
@abertsch72
Amanda Bertsch
2 years
if driving makes you so frustrated that you need to tailgate and scream obscenities at bicyclists sharing the road with you, maybe you…shouldn’t drive?
1
0
10
@abertsch72
Amanda Bertsch
11 months
However, many modern methods such as self-consistency (by X Wang, @jasonwei , et al), range voting (by @borgeaud_s & G Emerson), and output-ensembling methods (e.g. Post-Ensemble by @hakobayato ) can be written as subcases or variants of MBR. Our taxonomy provides a unified view.
Tweet media one
1
0
9
@abertsch72
Amanda Bertsch
1 year
@yoavartzi @kohjingyu 's FROMAGe and GILL might be relevant, and they're both public!
@kohjingyu
Jing Yu Koh
1 year
Excited to share GILL🐟 — a method to fuse LLMs with image encoder and decoder models! GILL is one of the first approaches capable of conditioning on interleaved image-text inputs to generate image-text outputs. Paper: Website:
Tweet media one
11
88
426
0
0
9
@abertsch72
Amanda Bertsch
2 years
Excited to present this work as a poster at GEM this week! Stop by from 2-3:30 on Wednesday in Capital Suite 5 :)
@abertsch72
Amanda Bertsch
2 years
What if you could get the benefits of extractive summarization in the dialogue domain? Check out our #EMNLP2022 Findings paper "He Said, She Said: Style Transfer for Shifting the Perspectives of Dialogues"! paper link: 🧵👇 (1/9)
1
8
51
0
0
8
@abertsch72
Amanda Bertsch
11 months
Also, understanding terminology drift over time is crucial to understanding phenomena like citational amnesia (by @iamjanvijay , @mukund_rungta1 et al), where NLP papers have decreasing temporal diversity of citations, often only citing very recent work.
2
0
8
@abertsch72
Amanda Bertsch
2 years
"yesterday's airport of tomorrow" is such a perfectly weird slogan for an airport. truly today's airport of today
0
0
7
@abertsch72
Amanda Bertsch
11 months
We hope this work leads to broader awareness of MBR and its connections to other methods. This was joint work with the wonderful @axie66 (co-first), @gneubig , and Matt Gormley!
0
0
7
@abertsch72
Amanda Bertsch
11 months
MBR is conceptually simple; the idea is to sample a set of outputs & choose one that is both high probability and high agreement with other outputs (i.e., low risk). While there’s a small community studying MBR, it's not as widely used in the modern NLP/LLM communities.
Tweet media one
1
0
7
@abertsch72
Amanda Bertsch
9 months
Congrats @julius_gulius !! Exciting times for MBR :)
@WilliamWangNLP
William Wang
9 months
Best short paper is about minimum Bayes risk decoding. #EMNLP2023
Tweet media one
0
1
24
0
0
6
@abertsch72
Amanda Bertsch
11 months
We show that MBR-like methods can be sensitive to design choices, and the MBR literature exposes improvements & extensions for MBR-like methods. Convergent evolution of methods is natural, but we can better understand how our systems work when we see the underlying connections!
1
0
6
@abertsch72
Amanda Bertsch
9 months
There's so much work right now about human feedback-- collecting it, training with it, evaluating with it... our survey paper provides an overview and taxonomy to make sense of this wave of work. @psanfernandes will be presenting our poster 12/8 at 11am!
@psanfernandes
Patrick Fernandes
1 year
*Human feedback* was the necessary secret sauce in making #chatgpt so human-like But what exactly is feedback? And how can we leverage it to improve our models? Check out our new survey on the use of (human) feedback in Natural Language Generation! 1/16
Tweet media one
8
101
437
1
0
6
@abertsch72
Amanda Bertsch
1 year
@joyem4j for analysis of very long text, this could be an option! we study only generation (not classification) in this work, but either should work :) unlimiformer uses an underlying (usually pretrained) model-- so perf in your setting may depend on how good the available models are
3
0
6
@abertsch72
Amanda Bertsch
11 months
This provides theoretical justification for these methods' empirical success; for instance, results from the MBR literature explain the self-consistency authors' observations about the relative performance of sampling methods in this figure from their paper:
Tweet media one
1
0
6
@abertsch72
Amanda Bertsch
9 months
Many decoding methods can be viewed as variants on a classical method-- @axie66 and I will be talking about minimum Bayes risk (MBR), mode-seeking search, and terminology drift at our Big Picture poster *today* (12/6), 2:30-3:30!
@abertsch72
Amanda Bertsch
11 months
What do decoding methods including self-consistency, output ensembling, and range voting have in common? They’re all variants of Minimum Bayes Risk (MBR) decoding! This useful and easy-to-apply method generalizes many modern generation techniques! 👇🧵
Tweet media one
1
40
190
1
1
5
@abertsch72
Amanda Bertsch
1 year
@emilymbender @aclanthology I've wondered about this a lot too! I just checked a very weak estimator, which is the % of papers in the anthology that reference Wikipedia by name: 2000-2009: 2.9% 2010-2019: 18.6% 2020-2022: 24.9% Of course, that ignores the use of models trained on Wikipedia...
1
0
4
@abertsch72
Amanda Bertsch
1 year
If you've ever wondered about the role of semantic representations in modern models, Sireesh has some careful and thorough and really cool work in this area! 4:30 in the IE session
@_sireesh
Sireesh Gururaja
1 year
How can we build models that work better in specialized domains with less data? We investigate using AMRs and dependency parses to facilitate transfer in relation classification between general (cooking) and domain-specific text (materials science).
Tweet media one
1
4
23
0
0
5
@abertsch72
Amanda Bertsch
9 months
Many people who aren't in NLP/ML are excited about using LMs for their tasks, but feel locked in to using model APIs. Can we enable these folks to finetune models instead? @vijaytarian and @Chenan3_Zhao will be presenting a poster/demo on 12/10 at 11am!
@vijaytarian
Vijay V.
1 year
LLM APIs let you build NLP systems in seconds; just write a prompt and use it as you wish. But APIs cost money and have privacy concerns. Our new library Prompt2Model turns a prompt into a small expert model that can match LLM performance but runs locally!
5
48
164
1
0
5
@abertsch72
Amanda Bertsch
2 years
this message brought to you by my abnormally scary afternoon commute!
1
0
5
@abertsch72
Amanda Bertsch
1 year
@gneubig @vijaytarian @urialon1 what's an extra 440k tokens among friends? plus we summarize books up to ~530k tokens in the original paper!
0
0
4
@abertsch72
Amanda Bertsch
2 years
This is joint work with @gneubig and Matt Gormley. Please check out the paper for more details! If you have questions, I'd love to chat about this work on Twitter, over email, or in-person at EMNLP this December! :) (9/9)
0
0
4
@abertsch72
Amanda Bertsch
2 years
Our experiments are using SAMSum, but we release annotation materials to replicate on other datasets, and analysis of performance on two other dialogue types: podcast transcripts and Dungeons & Dragons transcripts! (8/9)
1
0
4
@abertsch72
Amanda Bertsch
1 year
@CustomWetware certainly possible! we use kNN search to perform attention-- we are (approximately) choosing the embeddings that would receive the most attention mass. I'm not immediately sure what the interpretation of SVM retrieval would be, but it would be interesting to try empirically!
1
0
3
@abertsch72
Amanda Bertsch
11 months
@mrdrozdov it's definitely one of the more controversial takes we heard!
1
0
4
@abertsch72
Amanda Bertsch
9 months
@SashaMTL @giffmana (shameless plug but) this was one inspiration behind our recent EMNLP demo paper prompt2model! Very excited about this kind of project as a way of enabling people to move away from the "one giant (API-gated) model for everything" setup downstream :)
@vijaytarian
Vijay V.
1 year
LLM APIs let you build NLP systems in seconds; just write a prompt and use it as you wish. But APIs cost money and have privacy concerns. Our new library Prompt2Model turns a prompt into a small expert model that can match LLM performance but runs locally!
5
48
164
0
0
3
@abertsch72
Amanda Bertsch
11 months
@jeffbigham @khoomeik honestly a bit surprised at how bad ChatGPT is at mimicking Wikipedia style here! this would be a terrible intro paragraph-- every other phrase is puffery
0
0
3
@abertsch72
Amanda Bertsch
1 year
@lucy3_li Some of the foster places run a multi-round review process now! I know someone who had to do an interview about their application when they were trying to adopt a rescue puppy (By contrast, my cat was almost suspiciously easy to get)
1
0
2
@abertsch72
Amanda Bertsch
2 years
This requires some complex coreference resolution, emotion attribution, and (especially for text messages) formalization. In our experiments on SAMSum, <1k parallel training examples is enough to achieve high performance on this task! But how is this useful? (3/9)
Tweet media one
1
0
3
@abertsch72
Amanda Bertsch
2 years
Extractive summaries of the perspective shift have a substantially lower hallucination rate than a standard abstractive system over the original dialogues (5% vs 22% in our sample!). (6/9)
1
0
3
@abertsch72
Amanda Bertsch
2 years
Perspective shifted dialogues have the same /meaning/ as the original dialogue, but their /form/ is much closer to "standardized" text like news. This new input form is also the appropriate style for a summary, making extractive summarization feasible. (4/9)
1
0
3
@abertsch72
Amanda Bertsch
9 months
@shaily99 @QueerinAI it's a wug! I'll grab you one :)
2
0
3
@abertsch72
Amanda Bertsch
2 years
And, an added benefit over a traditional extract-then-abstract system: because of the design of the task, these summaries attribute each claim to the person who sent that message! (7/9)
Tweet media one
1
0
3
@abertsch72
Amanda Bertsch
2 years
This allows us to use extractive models trained on news summarization to effectively summarize dialogues! Adding data from the dialogue domain further improves perf, even when training with model-generated perspective shifts (5/9).
Tweet media one
1
0
3
@abertsch72
Amanda Bertsch
2 years
@davidthewid has Daniel considered changing his name? he’s clearly been outvoted :(
0
0
2
@abertsch72
Amanda Bertsch
10 months
@m2saxon @shaily99 yup, this happened to me! They were very nice about it, said it happened all the time with their AZ candidates 🫠
0
0
2
@abertsch72
Amanda Bertsch
1 year
@trashhalo they're quite different! Alibi requires pretraining + assumes closer tokens are more important to each other; original paper shows 2x length at test time. Unlimiformer is used after pretraining and its behavior is the same regardless of length; it scales to 100x longer inputs
1
0
2
@abertsch72
Amanda Bertsch
1 year
@mrdrozdov This was a surprise to me as well! I think it's important to point out when our baselines are unexpectedly strong, especially when they're so simple :)
1
0
2
@abertsch72
Amanda Bertsch
2 years
@dmort27 @SCSatCMU @CMU_Robotics This is more of a tooth than a horn, but this battlebot looks a bit like a unicorn!
Tweet media one
0
0
2
@abertsch72
Amanda Bertsch
2 years
We propose a new style transfer task: rewriting a dialogue from 1st to 3rd person. This is like reframing the conversation: instead of hearing each speaker talk, there's a single 3rd-person description of what each person said. (2/9)
Tweet media one
1
0
2
@abertsch72
Amanda Bertsch
1 year
@CustomWetware (I say "approximately" because we do an approximate kNN search using FAISS. The other factor here would be relative efficiency of the two methods-- though it's possible that's a performance/efficiency tradeoff worth making for some applications.)
0
0
1
@abertsch72
Amanda Bertsch
1 year
@divideconcept @Tim_Dettmers thanks for your interest! adding unlimiformer (without training) generally slightly boosts performance from the base model (see table 3/5 for some examples). training with unlimiformer generally significantly improves performance! (1/2)
1
0
2
@abertsch72
Amanda Bertsch
11 months
@kohjingyu fwiw, I've occasionally said this to folks at cmu and it worked fine! It's not as much of a norm but people will do it
0
0
1
@abertsch72
Amanda Bertsch
2 years
@_sireesh obligatory SIGBOVIK reference:
Tweet media one
1
0
1
@abertsch72
Amanda Bertsch
1 year
@joyem4j Hey! This is a bit outside my research area, so I'm afraid I don't have many concrete recommendations. The AfricaNLP workshops may have some good references; this is the most recent one:
1
0
1
@abertsch72
Amanda Bertsch
1 year
@uilydna @LTIatCMU Congrats and welcome!!
0
0
1
@abertsch72
Amanda Bertsch
1 year
@chan_young_park @anjalie_f @linnyKos @Wikimedia Congrats Chan!! NLP has benefited so much from Wikipedia data, it's great to see work recognized that gives something back too :)
0
0
1
@abertsch72
Amanda Bertsch
1 year
@davidthewid this is actually a feature of the library we use for nearest neighbors search! FAISS does an approximate search (sort things into buckets, pick the top n most similar buckets to the query, do exact search within those buckets)
2
0
1
@abertsch72
Amanda Bertsch
1 year
@ImageDeeply thanks for pointing this out! license is MIT; it's been added to the repo.
0
0
1
@abertsch72
Amanda Bertsch
1 year
@divideconcept @Tim_Dettmers slight drawback: we cannot backprop through all embeddings at training time (only the first 16k or so, due to memory constraints). we propose a workaround for this (chunking inputs into multiple examples), but there's definitely more to be done in this space! (2/2)
0
0
1
@abertsch72
Amanda Bertsch
2 years
0
0
1
@abertsch72
Amanda Bertsch
1 year
@fire @urialon1 @gneubig It might be challenging to combine with RMT directly-- I think the key decision would be whether memory tokens are always retrieved (which reduces effective retrieval window) or indexed with the rest of input (which risks losing the memory vector in some decoding steps)
1
0
1
@abertsch72
Amanda Bertsch
1 year
@ericjang11 thanks for the ref! I've read this but must have overlooked it when writing the related work-- we'll be sure to add it in the next version :)
0
0
1
@abertsch72
Amanda Bertsch
1 year
@trashhalo I'd use alibi to pretrain a model from scratch! If you have a model already pretrained, Unlimiformer can expand context length. Unlimiformer can also be added to an alibi'd model-- not sure how alibi would impact perf, but our method does not modify the position representation
0
0
1
@abertsch72
Amanda Bertsch
1 year
@xiangrenNLP Congrats!!
0
0
1
@abertsch72
Amanda Bertsch
2 years
We also observe biases that reflect political values in each era— such as relatively low unpleasantness of weapons during World War II, when Hollywood produced war propaganda, and relatively high unpleasantness of weapons during the Vietnam War Era, when Hollywood was anti-war.
1
0
1
@abertsch72
Amanda Bertsch
1 year
@fire @urialon1 @gneubig But it's certainly possible to engineer a solution that merges the two! I don't have a solid guess at what that would do to the performance-- we don't have the capacity to try this right now, but might be interesting to find out
0
0
1
@abertsch72
Amanda Bertsch
1 year
@javad_mohmad thanks for the interest! we don't update anything in the tokenizer, so the BART-base one will work
0
0
1