Amanda Bertsch @abertsch72 Twitter profile | Pikagi

Pikagi

Amanda Bertsch

@abertsch72

1,407

Followers

778

Following

23

Media

299

Statuses

PhD student @LTIatCMU / @SCSatCMU , researching text generation + summarization | she/her | also @ abertsch on bsky or or by email ()

https://t.co/sm5xHuavK1

Joined August 2014

Don't wanna be here? Send us removal request.

Pinned Tweet

@abertsch72

Amanda Bertsch

4 months

In-context learning provides an LLM with a few examples to improve accuracy. But with long-context LLMs, we can now use *thousands* of examples in-context. We find that this long-context ICL paradigm is surprisingly effective– and differs in behavior from short-context ICL! 🧵

Tweet media one

14

108

570

Last Seen Profiles

@SplashGo5

@indianawalsh

@bokeplokalmalam

@bokeplokalmalam

@YaGurlWoo

@mfy20274998

@Jaypanda4CM

@shaunoneil3

@sugadiosdelrap

@nenepipp

@washcochamberor

@kirginayse1

@dul_turkporno

@s53rstar

@kana_qr0604

@sayo_kiyomafu

@Pecca86

@DinkyLinkers

@AsianTVAwards

@noctis0904

@phil0s0phies

@playtankhead

@skydronescom

@rinrin127922

@HAHakonApeland

@sanga_mata

@_Adilinho_

@since2470

@IntelZhu

@Srikant59459600

@cuckqueeen

@maria_monfo

@lyrmbhh

@DukeeR6

@bokeplokalmalam

@cmachiattorpm24

@abertsch72

Amanda Bertsch

1 year

What if we could run Transformer models without worrying about context length? With our new work Unlimiformer, you can jailbreak your current models to use unlimited length inputs! Preprint: Thread 🧵 (1/6)

Tweet card media

Unlimiformer: Long-Range Transformers with Unlimited Length Input

Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a...

19

162

953

@abertsch72

Amanda Bertsch

11 months

What do decoding methods including self-consistency, output ensembling, and range voting have in common? They’re all variants of Minimum Bayes Risk (MBR) decoding! This useful and easy-to-apply method generalizes many modern generation techniques! 👇🧵

Tweet media one

1

40

190

@abertsch72

Amanda Bertsch

1 year

Now accepted to NeurIPS'23! Looking forward to talking to folks in New Orleans 🎉

@abertsch72

Amanda Bertsch

1 year

What if we could run Transformer models without worrying about context length? With our new work Unlimiformer, you can jailbreak your current models to use unlimited length inputs! Preprint: Thread 🧵 (1/6)

19

162

953

3

12

66

@abertsch72

Amanda Bertsch

1 year

Unlimiformer is a retrieval-augmentation method for encoder-decoder models. It dynamically updates the context window at each decoding step, so that each head in each layer attends to its own top-k input tokens. Unlimiformer can be added to any pretrained encoder-decoder! (2/6)

Tweet media one

1

6

59

@abertsch72

Amanda Bertsch

1 year

This is joint work with the wonderful @urialon1 , @gneubig , and Matt Gormley! Preprint: Github: (6/6)

Tweet card media

GitHub - abertsch72/unlimiformer: Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range...

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input" - abertsch72/unlimiformer

3

3

52

@abertsch72

Amanda Bertsch

2 years

What if you could get the benefits of extractive summarization in the dialogue domain? Check out our #EMNLP2022 Findings paper "He Said, She Said: Style Transfer for Shifting the Perspectives of Dialogues"! paper link: 🧵👇 (1/9)

Tweet card media

He Said, She Said: Style Transfer for Shifting the Perspective of Dialogues

In this work, we define a new style transfer task: perspective shift, which reframes a dialogue from informal first person to a formal third person rephrasing of the text. This task requires...

1

8

51

@abertsch72

Amanda Bertsch

1 year

At test time, Unlimiformer can summarize *books* of more than 300k tokens without truncation! And inference cost increases sub-linearly with input length, so summarizing 100k token inputs is only 1.5x slower than summarizing 1k token inputs. (5/6)

Tweet media one

2

0

49

@abertsch72

Amanda Bertsch

9 months

I'm at NeurIPS this week! Would love to chat with folks interested in long contexts, fair evaluation, decoding, and/or philosophy of science. I'll also be presenting our paper Unlimiformer (below) in poster session 5: 10:45-12:45 Thursday. Check it out at board # 524!

@abertsch72

Amanda Bertsch

1 year

What if we could run Transformer models without worrying about context length? With our new work Unlimiformer, you can jailbreak your current models to use unlimited length inputs! Preprint: Thread 🧵 (1/6)

19

162

953

0

6

44

@abertsch72

Amanda Bertsch

9 months

Excited to see new and old friends this week in Singapore! Happy to chat about long context reasoning, summarization, decoding, philosophy of science, or finding good vegetarian food at hawker centers :) My collaborators and I are presenting several papers, starting today! ⬇️

1

8

34

@abertsch72

Amanda Bertsch

1 year

We reformulate attention for efficient retrieval using a single datastore, shared across all cross-attention heads in all decoder layers. This requires storing only a single vector per input token, allowing inputs of more than half a million tokens on a single GPU! (3/6)

Tweet media one

2

0

28

@abertsch72

Amanda Bertsch

1 year

Unlimiformer improves existing models without any fine-tuning, in cheap fine-tuning regimes, and with specialized training. Not only does it outperform other strong long-range transformers, but it can be applied on top of models such as LongFormer for further improvements! (4/6)

Tweet media one

1

0

21

@abertsch72

Amanda Bertsch

10 months

@nsaphra @enfleisig @kchonyc excited to read this, it looks really cool! you might also like our paper, which discusses some similar themes from a different methodological angle :)

@_sireesh

Sireesh Gururaja

11 months

We all know that “recently large language models have”, “large language models are”, and “large language models can.” But why LLMs? How did we get here? (where is “here”?) What forces are shaping NLP, and how recent are they, actually? To appear at EMNLP:

Tweet media one

6

66

290

1

2

12

@abertsch72

Amanda Bertsch

1 year

Check out our survey on using human feedback for generation!

@psanfernandes

Patrick Fernandes

1 year

*Human feedback* was the necessary secret sauce in making #chatgpt so human-like But what exactly is feedback? And how can we leverage it to improve our models? Check out our new survey on the use of (human) feedback in Natural Language Generation! 1/16

Tweet media one

8

101

437

0

0

12

@abertsch72

Amanda Bertsch

2 years

#NAACL2022 this week has been incredible! For folks still in Seattle, I’ll be co-presenting our paper “Evaluating Gender Bias Transfer from Film” at 4:15 in room 702-Clearwater, at the @genderbiasnlp workshop. (Not at NAACL or left a bit early? summary below ⬇️🧵)

1

1

11

@abertsch72

Amanda Bertsch

9 months

And finally, I'm so proud of our study of cultural and methodological shifts in NLP. What are the incentives that drive our field? How'd we wind up here? @claranahhh will be giving a talk on this work 12/10 at 11:30 in the West 2 Ballroom!

@_sireesh

Sireesh Gururaja

11 months

We all know that “recently large language models have”, “large language models are”, and “large language models can.” But why LLMs? How did we get here? (where is “here”?) What forces are shaping NLP, and how recent are they, actually? To appear at EMNLP:

Tweet media one

6

66

290

0

3

10

@abertsch72

Amanda Bertsch

2 years

if driving makes you so frustrated that you need to tailgate and scream obscenities at bicyclists sharing the road with you, maybe you…shouldn’t drive?

1

0

10

@abertsch72

Amanda Bertsch

11 months

However, many modern methods such as self-consistency (by X Wang, @jasonwei , et al), range voting (by @borgeaud_s & G Emerson), and output-ensembling methods (e.g. Post-Ensemble by @hakobayato ) can be written as subcases or variants of MBR. Our taxonomy provides a unified view.

Tweet media one

1

0

9

@abertsch72

Amanda Bertsch

1 year

@yoavartzi @kohjingyu 's FROMAGe and GILL might be relevant, and they're both public!

@kohjingyu

Jing Yu Koh

1 year

Excited to share GILL🐟 — a method to fuse LLMs with image encoder and decoder models! GILL is one of the first approaches capable of conditioning on interleaved image-text inputs to generate image-text outputs. Paper: Website:

Tweet media one

11

88

426

0

0

9

@abertsch72

Amanda Bertsch

2 years

Excited to present this work as a poster at GEM this week! Stop by from 2-3:30 on Wednesday in Capital Suite 5 :)

@abertsch72

Amanda Bertsch

2 years

What if you could get the benefits of extractive summarization in the dialogue domain? Check out our #EMNLP2022 Findings paper "He Said, She Said: Style Transfer for Shifting the Perspectives of Dialogues"! paper link: 🧵👇 (1/9)

1

8

51

0

0

8

@abertsch72

Amanda Bertsch

11 months

Also, understanding terminology drift over time is crucial to understanding phenomena like citational amnesia (by @iamjanvijay , @mukund_rungta1 et al), where NLP papers have decreasing temporal diversity of citations, often only citing very recent work.

2

0

8

@abertsch72

Amanda Bertsch

2 years

"yesterday's airport of tomorrow" is such a perfectly weird slogan for an airport. truly today's airport of today

0

0

7

@abertsch72

Amanda Bertsch

11 months

We hope this work leads to broader awareness of MBR and its connections to other methods. This was joint work with the wonderful @axie66 (co-first), @gneubig , and Matt Gormley!

0

0

7

@abertsch72

Amanda Bertsch

11 months

MBR is conceptually simple; the idea is to sample a set of outputs & choose one that is both high probability and high agreement with other outputs (i.e., low risk). While there’s a small community studying MBR, it's not as widely used in the modern NLP/LLM communities.

Tweet media one

1

0

7

@abertsch72

Amanda Bertsch

9 months

Congrats @julius_gulius !! Exciting times for MBR :)

@WilliamWangNLP

William Wang

@WilliamWangNLP

9 months

Best short paper is about minimum Bayes risk decoding. #EMNLP2023

Tweet media one

0

1

24

0

0

6

@abertsch72

Amanda Bertsch

11 months

We show that MBR-like methods can be sensitive to design choices, and the MBR literature exposes improvements & extensions for MBR-like methods. Convergent evolution of methods is natural, but we can better understand how our systems work when we see the underlying connections!

1

0

6

@abertsch72

Amanda Bertsch

9 months

There's so much work right now about human feedback-- collecting it, training with it, evaluating with it... our survey paper provides an overview and taxonomy to make sense of this wave of work. @psanfernandes will be presenting our poster 12/8 at 11am!

@psanfernandes

Patrick Fernandes

1 year

*Human feedback* was the necessary secret sauce in making #chatgpt so human-like But what exactly is feedback? And how can we leverage it to improve our models? Check out our new survey on the use of (human) feedback in Natural Language Generation! 1/16

Tweet media one

8

101

437

1

0

6

@abertsch72

Amanda Bertsch

1 year

@joyem4j for analysis of very long text, this could be an option! we study only generation (not classification) in this work, but either should work :) unlimiformer uses an underlying (usually pretrained) model-- so perf in your setting may depend on how good the available models are

3

0

6

@abertsch72

Amanda Bertsch

11 months

This provides theoretical justification for these methods' empirical success; for instance, results from the MBR literature explain the self-consistency authors' observations about the relative performance of sampling methods in this figure from their paper:

Tweet media one

1

0

6

@abertsch72

Amanda Bertsch

9 months

Many decoding methods can be viewed as variants on a classical method-- @axie66 and I will be talking about minimum Bayes risk (MBR), mode-seeking search, and terminology drift at our Big Picture poster *today* (12/6), 2:30-3:30!

@abertsch72

Amanda Bertsch

11 months

What do decoding methods including self-consistency, output ensembling, and range voting have in common? They’re all variants of Minimum Bayes Risk (MBR) decoding! This useful and easy-to-apply method generalizes many modern generation techniques! 👇🧵

Tweet media one

1

40

190

1

1

5

@abertsch72

Amanda Bertsch

1 year

@emilymbender @aclanthology I've wondered about this a lot too! I just checked a very weak estimator, which is the % of papers in the anthology that reference Wikipedia by name: 2000-2009: 2.9% 2010-2019: 18.6% 2020-2022: 24.9% Of course, that ignores the use of models trained on Wikipedia...

1

0

4

@abertsch72

Amanda Bertsch

1 year

If you've ever wondered about the role of semantic representations in modern models, Sireesh has some careful and thorough and really cool work in this area! 4:30 in the IE session

@_sireesh

Sireesh Gururaja

1 year

How can we build models that work better in specialized domains with less data? We investigate using AMRs and dependency parses to facilitate transfer in relation classification between general (cooking) and domain-specific text (materials science).

Tweet media one

1

4

23

0

0

5

@abertsch72

Amanda Bertsch

9 months

Many people who aren't in NLP/ML are excited about using LMs for their tasks, but feel locked in to using model APIs. Can we enable these folks to finetune models instead? @vijaytarian and @Chenan3_Zhao will be presenting a poster/demo on 12/10 at 11am!

@vijaytarian

Vijay V.

1 year

LLM APIs let you build NLP systems in seconds; just write a prompt and use it as you wish. But APIs cost money and have privacy concerns. Our new library Prompt2Model turns a prompt into a small expert model that can match LLM performance but runs locally!

5

48

164

1

0

5

@abertsch72

Amanda Bertsch

2 years

this message brought to you by my abnormally scary afternoon commute!

1

0

5

@abertsch72

Amanda Bertsch

1 year

@gneubig @vijaytarian @urialon1 what's an extra 440k tokens among friends? plus we summarize books up to ~530k tokens in the original paper!

0

0

4

@abertsch72

Amanda Bertsch

2 years

This is joint work with @gneubig and Matt Gormley. Please check out the paper for more details! If you have questions, I'd love to chat about this work on Twitter, over email, or in-person at EMNLP this December! :) (9/9)

0

0

4

@abertsch72

Amanda Bertsch

2 years

Our experiments are using SAMSum, but we release annotation materials to replicate on other datasets, and analysis of performance on two other dialogue types: podcast transcripts and Dungeons & Dragons transcripts! (8/9)

1

0

4

@abertsch72

Amanda Bertsch

1 year

@CustomWetware certainly possible! we use kNN search to perform attention-- we are (approximately) choosing the embeddings that would receive the most attention mass. I'm not immediately sure what the interpretation of SVM retrieval would be, but it would be interesting to try empirically!

1

0

3

@abertsch72

Amanda Bertsch

11 months

@mrdrozdov it's definitely one of the more controversial takes we heard!

1

0

4

@abertsch72

Amanda Bertsch

9 months

@SashaMTL @giffmana (shameless plug but) this was one inspiration behind our recent EMNLP demo paper prompt2model! Very excited about this kind of project as a way of enabling people to move away from the "one giant (API-gated) model for everything" setup downstream :)

@vijaytarian

Vijay V.

1 year

LLM APIs let you build NLP systems in seconds; just write a prompt and use it as you wish. But APIs cost money and have privacy concerns. Our new library Prompt2Model turns a prompt into a small expert model that can match LLM performance but runs locally!

5

48

164

0

0

3

@abertsch72

Amanda Bertsch

11 months

@jeffbigham @khoomeik honestly a bit surprised at how bad ChatGPT is at mimicking Wikipedia style here! this would be a terrible intro paragraph-- every other phrase is puffery

0

0

3

@abertsch72

Amanda Bertsch

1 year

@lucy3_li Some of the foster places run a multi-round review process now! I know someone who had to do an interview about their application when they were trying to adopt a rescue puppy (By contrast, my cat was almost suspiciously easy to get)

1

0

2

@abertsch72

Amanda Bertsch

2 years

This requires some complex coreference resolution, emotion attribution, and (especially for text messages) formalization. In our experiments on SAMSum, <1k parallel training examples is enough to achieve high performance on this task! But how is this useful? (3/9)

Tweet media one

1

0

3

@abertsch72

Amanda Bertsch

2 years

Extractive summaries of the perspective shift have a substantially lower hallucination rate than a standard abstractive system over the original dialogues (5% vs 22% in our sample!). (6/9)

1

0

3

@abertsch72

Amanda Bertsch

2 years

Perspective shifted dialogues have the same /meaning/ as the original dialogue, but their /form/ is much closer to "standardized" text like news. This new input form is also the appropriate style for a summary, making extractive summarization feasible. (4/9)

1

0

3

@abertsch72

Amanda Bertsch

9 months

@shaily99 @QueerinAI it's a wug! I'll grab you one :)

2

0

3

@abertsch72

Amanda Bertsch

2 years

And, an added benefit over a traditional extract-then-abstract system: because of the design of the task, these summaries attribute each claim to the person who sent that message! (7/9)

Tweet media one

1

0

3

@abertsch72

Amanda Bertsch

2 years

This allows us to use extractive models trained on news summarization to effectively summarize dialogues! Adding data from the dialogue domain further improves perf, even when training with model-generated perspective shifts (5/9).

Tweet media one

1

0

3

@abertsch72

Amanda Bertsch

2 years

@davidthewid has Daniel considered changing his name? he’s clearly been outvoted :(

0

0

2

@abertsch72

Amanda Bertsch

10 months

@m2saxon @shaily99 yup, this happened to me! They were very nice about it, said it happened all the time with their AZ candidates 🫠

0

0

2

@abertsch72

Amanda Bertsch

1 year

@trashhalo they're quite different! Alibi requires pretraining + assumes closer tokens are more important to each other; original paper shows 2x length at test time. Unlimiformer is used after pretraining and its behavior is the same regardless of length; it scales to 100x longer inputs

1

0

2

@abertsch72

Amanda Bertsch

1 year

@mrdrozdov This was a surprise to me as well! I think it's important to point out when our baselines are unexpectedly strong, especially when they're so simple :)

1

0

2

@abertsch72

Amanda Bertsch

2 years

@dmort27 @SCSatCMU @CMU_Robotics This is more of a tooth than a horn, but this battlebot looks a bit like a unicorn!

Tweet media one

0

0

2

@abertsch72

Amanda Bertsch

2 years

We propose a new style transfer task: rewriting a dialogue from 1st to 3rd person. This is like reframing the conversation: instead of hearing each speaker talk, there's a single 3rd-person description of what each person said. (2/9)

Tweet media one

1

0

2

@abertsch72

Amanda Bertsch

1 year

@CustomWetware (I say "approximately" because we do an approximate kNN search using FAISS. The other factor here would be relative efficiency of the two methods-- though it's possible that's a performance/efficiency tradeoff worth making for some applications.)

0

0

1

@abertsch72

Amanda Bertsch

2 years

Much more in the paper () and in our presentation (4:15 in 702 Clearwater)! Hope to see people there :)

Tweet card media

Evaluating Gender Bias Transfer from Film Data

Amanda Bertsch, Ashley Oh, Sanika Natu, Swetha Gangu, Alan W. Black, Emma Strubell. Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP). 2022.

aclanthology.org

0

0

2

@abertsch72

Amanda Bertsch

1 year

@divideconcept @Tim_Dettmers thanks for your interest! adding unlimiformer (without training) generally slightly boosts performance from the base model (see table 3/5 for some examples). training with unlimiformer generally significantly improves performance! (1/2)

1

0

2

@abertsch72

Amanda Bertsch

1 year

@nsubramani23 @LTIatCMU @mmitchell_ai @_DougDowney @mattthemathman @allen_ai Congrats and welcome!!

0

0

1

@abertsch72

Amanda Bertsch

11 months

@kohjingyu fwiw, I've occasionally said this to folks at cmu and it worked fine! It's not as much of a norm but people will do it

0

0

1

@abertsch72

Amanda Bertsch

2 years

@_sireesh obligatory SIGBOVIK reference:

Tweet media one

1

0

1

@abertsch72

Amanda Bertsch

1 year

@joyem4j Hey! This is a bit outside my research area, so I'm afraid I don't have many concrete recommendations. The AfricaNLP workshops may have some good references; this is the most recent one:

Tweet card media

About the Workshop

sites.google.com

1

0

1

@abertsch72

Amanda Bertsch

1 year

@uilydna @LTIatCMU Congrats and welcome!!

0

0

1

@abertsch72

Amanda Bertsch

1 year

@chan_young_park @anjalie_f @linnyKos @Wikimedia Congrats Chan!! NLP has benefited so much from Wikipedia data, it's great to see work recognized that gives something back too :)

0

0

1

@abertsch72

Amanda Bertsch

1 year

@davidthewid this is actually a feature of the library we use for nearest neighbors search! FAISS does an approximate search (sort things into buckets, pick the top n most similar buckets to the query, do exact search within those buckets)

GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense...

A library for efficient similarity search and clustering of dense vectors. - facebookresearch/faiss

2

0

1

@abertsch72

Amanda Bertsch

1 year

@ImageDeeply thanks for pointing this out! license is MIT; it's been added to the repo.

0

0

1

@abertsch72

Amanda Bertsch

1 year

@divideconcept @Tim_Dettmers slight drawback: we cannot backprop through all embeddings at training time (only the first 16k or so, due to memory constraints). we propose a workaround for this (chunking inputs into multiple examples), but there's definitely more to be done in this space! (2/2)

0

0

1

@abertsch72

Amanda Bertsch

2 years

@davidthewid @_sireesh absolutely!!

0

0

1

@abertsch72

Amanda Bertsch

1 year

@fire @urialon1 @gneubig It might be challenging to combine with RMT directly-- I think the key decision would be whether memory tokens are always retrieved (which reduces effective retrieval window) or indexed with the rest of input (which risks losing the memory vector in some decoding steps)

1

0

1

@abertsch72

Amanda Bertsch

1 year

@ericjang11 thanks for the ref! I've read this but must have overlooked it when writing the related work-- we'll be sure to add it in the next version :)

0

0

1

@abertsch72

Amanda Bertsch

1 year

@trashhalo I'd use alibi to pretrain a model from scratch! If you have a model already pretrained, Unlimiformer can expand context length. Unlimiformer can also be added to an alibi'd model-- not sure how alibi would impact perf, but our method does not modify the position representation

0

0

1

@abertsch72

Amanda Bertsch

1 year

@xiangrenNLP Congrats!!

0

0

1

@abertsch72

Amanda Bertsch

2 years

We also observe biases that reflect political values in each era— such as relatively low unpleasantness of weapons during World War II, when Hollywood produced war propaganda, and relatively high unpleasantness of weapons during the Vietnam War Era, when Hollywood was anti-war.

1

0

1

@abertsch72

Amanda Bertsch

1 year

@fire @urialon1 @gneubig But it's certainly possible to engineer a solution that merges the two! I don't have a solid guess at what that would do to the performance-- we don't have the capacity to try this right now, but might be interesting to find out

0

0

1

@abertsch72

Amanda Bertsch

1 year

@tanyaagoyal @cs_cornell @princeton_nlp Congrats!!

0

0

1

@abertsch72

Amanda Bertsch

1 year

@javad_mohmad thanks for the interest! we don't update anything in the tokenizer, so the BART-base one will work

0

0

1