In-context learning provides an LLM with a few examples to improve accuracy. But with long-context LLMs, we can now use *thousands* of examples in-context.
We find that this long-context ICL paradigm is surprisingly effective– and differs in behavior from short-context ICL! 🧵
What if we could run Transformer models without worrying about context length?
With our new work Unlimiformer, you can jailbreak your current models to use unlimited length inputs!
Preprint:
Thread 🧵 (1/6)
What do decoding methods including self-consistency, output ensembling, and range voting have in common? They’re all variants of Minimum Bayes Risk (MBR) decoding!
This useful and easy-to-apply method generalizes many modern generation techniques!
👇🧵
What if we could run Transformer models without worrying about context length?
With our new work Unlimiformer, you can jailbreak your current models to use unlimited length inputs!
Preprint:
Thread 🧵 (1/6)
Unlimiformer is a retrieval-augmentation method for encoder-decoder models. It dynamically updates the context window at each decoding step, so that each head in each layer attends to its own top-k input tokens. Unlimiformer can be added to any pretrained encoder-decoder! (2/6)
What if you could get the benefits of extractive summarization in the dialogue domain? Check out our
#EMNLP2022
Findings paper "He Said, She Said: Style Transfer for Shifting the Perspectives of Dialogues"!
paper link:
🧵👇 (1/9)
At test time, Unlimiformer can summarize *books* of more than 300k tokens without truncation!
And inference cost increases sub-linearly with input length, so summarizing 100k token inputs is only 1.5x slower than summarizing 1k token inputs. (5/6)
I'm at NeurIPS this week! Would love to chat with folks interested in long contexts, fair evaluation, decoding, and/or philosophy of science.
I'll also be presenting our paper Unlimiformer (below) in poster session 5: 10:45-12:45 Thursday. Check it out at board # 524!
What if we could run Transformer models without worrying about context length?
With our new work Unlimiformer, you can jailbreak your current models to use unlimited length inputs!
Preprint:
Thread 🧵 (1/6)
Excited to see new and old friends this week in Singapore! Happy to chat about long context reasoning, summarization, decoding, philosophy of science, or finding good vegetarian food at hawker centers :)
My collaborators and I are presenting several papers, starting today! ⬇️
We reformulate attention for efficient retrieval using a single datastore, shared across all cross-attention heads in all decoder layers. This requires storing only a single vector per input token, allowing inputs of more than half a million tokens on a single GPU! (3/6)
Unlimiformer improves existing models without any fine-tuning, in cheap fine-tuning regimes, and with specialized training. Not only does it outperform other strong long-range transformers, but it can be applied on top of models such as LongFormer for further improvements! (4/6)
@nsaphra
@enfleisig
@kchonyc
excited to read this, it looks really cool! you might also like our paper, which discusses some similar themes from a different methodological angle :)
We all know that “recently large language models have”, “large language models are”, and “large language models can.” But why LLMs? How did we get here? (where is “here”?) What forces are shaping NLP, and how recent are they, actually?
To appear at EMNLP:
*Human feedback* was the necessary secret sauce in making
#chatgpt
so human-like
But what exactly is feedback? And how can we leverage it to improve our models?
Check out our new survey on the use of (human) feedback in Natural Language Generation!
1/16
#NAACL2022
this week has been incredible! For folks still in Seattle, I’ll be co-presenting our paper “Evaluating Gender Bias Transfer from Film” at 4:15 in room 702-Clearwater, at the
@genderbiasnlp
workshop.
(Not at NAACL or left a bit early? summary below ⬇️🧵)
And finally, I'm so proud of our study of cultural and methodological shifts in NLP. What are the incentives that drive our field? How'd we wind up here?
@claranahhh
will be giving a talk on this work 12/10 at 11:30 in the West 2 Ballroom!
We all know that “recently large language models have”, “large language models are”, and “large language models can.” But why LLMs? How did we get here? (where is “here”?) What forces are shaping NLP, and how recent are they, actually?
To appear at EMNLP:
if driving makes you so frustrated that you need to tailgate and scream obscenities at bicyclists sharing the road with you, maybe you…shouldn’t drive?
However, many modern methods such as self-consistency (by X Wang,
@jasonwei
, et al), range voting (by
@borgeaud_s
& G Emerson), and output-ensembling methods (e.g. Post-Ensemble by
@hakobayato
) can be written as subcases or variants of MBR. Our taxonomy provides a unified view.
Excited to share GILL🐟 — a method to fuse LLMs with image encoder and decoder models!
GILL is one of the first approaches capable of conditioning on interleaved image-text inputs to generate image-text outputs.
Paper:
Website:
What if you could get the benefits of extractive summarization in the dialogue domain? Check out our
#EMNLP2022
Findings paper "He Said, She Said: Style Transfer for Shifting the Perspectives of Dialogues"!
paper link:
🧵👇 (1/9)
Also, understanding terminology drift over time is crucial to understanding phenomena like citational amnesia (by
@iamjanvijay
,
@mukund_rungta1
et al), where NLP papers have decreasing temporal diversity of citations, often only citing very recent work.
We hope this work leads to broader awareness of MBR and its connections to other methods.
This was joint work with the wonderful
@axie66
(co-first),
@gneubig
, and Matt Gormley!
MBR is conceptually simple; the idea is to sample a set of outputs & choose one that is both high probability and high agreement with other outputs (i.e., low risk). While there’s a small community studying MBR, it's not as widely used in the modern NLP/LLM communities.
We show that MBR-like methods can be sensitive to design choices, and the MBR literature exposes improvements & extensions for MBR-like methods.
Convergent evolution of methods is natural, but we can better understand how our systems work when we see the underlying connections!
There's so much work right now about human feedback-- collecting it, training with it, evaluating with it... our survey paper provides an overview and taxonomy to make sense of this wave of work.
@psanfernandes
will be presenting our poster 12/8 at 11am!
*Human feedback* was the necessary secret sauce in making
#chatgpt
so human-like
But what exactly is feedback? And how can we leverage it to improve our models?
Check out our new survey on the use of (human) feedback in Natural Language Generation!
1/16
@joyem4j
for analysis of very long text, this could be an option! we study only generation (not classification) in this work, but either should work :)
unlimiformer uses an underlying (usually pretrained) model-- so perf in your setting may depend on how good the available models are
This provides theoretical justification for these methods' empirical success; for instance, results from the MBR literature explain the self-consistency authors' observations about the relative performance of sampling methods in this figure from their paper:
Many decoding methods can be viewed as variants on a classical method--
@axie66
and I will be talking about minimum Bayes risk (MBR), mode-seeking search, and terminology drift at our Big Picture poster *today* (12/6), 2:30-3:30!
What do decoding methods including self-consistency, output ensembling, and range voting have in common? They’re all variants of Minimum Bayes Risk (MBR) decoding!
This useful and easy-to-apply method generalizes many modern generation techniques!
👇🧵
@emilymbender
@aclanthology
I've wondered about this a lot too! I just checked a very weak estimator, which is the % of papers in the anthology that reference Wikipedia by name:
2000-2009: 2.9%
2010-2019: 18.6%
2020-2022: 24.9%
Of course, that ignores the use of models trained on Wikipedia...
If you've ever wondered about the role of semantic representations in modern models, Sireesh has some careful and thorough and really cool work in this area! 4:30 in the IE session
How can we build models that work better in specialized domains with less data?
We investigate using AMRs and dependency parses to facilitate transfer in relation classification between general (cooking) and domain-specific text (materials science).
Many people who aren't in NLP/ML are excited about using LMs for their tasks, but feel locked in to using model APIs. Can we enable these folks to finetune models instead?
@vijaytarian
and
@Chenan3_Zhao
will be presenting a poster/demo on 12/10 at 11am!
LLM APIs let you build NLP systems in seconds; just write a prompt and use it as you wish. But APIs cost money and have privacy concerns.
Our new library Prompt2Model turns a prompt into a small expert model that can match LLM performance but runs locally!
This is joint work with
@gneubig
and Matt Gormley. Please check out the paper for more details! If you have questions, I'd love to chat about this work on Twitter, over email, or in-person at EMNLP this December! :) (9/9)
Our experiments are using SAMSum, but we release annotation materials to replicate on other datasets, and analysis of performance on two other dialogue types: podcast transcripts and Dungeons & Dragons transcripts! (8/9)
@CustomWetware
certainly possible! we use kNN search to perform attention-- we are (approximately) choosing the embeddings that would receive the most attention mass. I'm not immediately sure what the interpretation of SVM retrieval would be, but it would be interesting to try empirically!
@SashaMTL
@giffmana
(shameless plug but) this was one inspiration behind our recent EMNLP demo paper prompt2model! Very excited about this kind of project as a way of enabling people to move away from the "one giant (API-gated) model for everything" setup downstream :)
LLM APIs let you build NLP systems in seconds; just write a prompt and use it as you wish. But APIs cost money and have privacy concerns.
Our new library Prompt2Model turns a prompt into a small expert model that can match LLM performance but runs locally!
@jeffbigham
@khoomeik
honestly a bit surprised at how bad ChatGPT is at mimicking Wikipedia style here! this would be a terrible intro paragraph-- every other phrase is puffery
@lucy3_li
Some of the foster places run a multi-round review process now! I know someone who had to do an interview about their application when they were trying to adopt a rescue puppy
(By contrast, my cat was almost suspiciously easy to get)
This requires some complex coreference resolution, emotion attribution, and (especially for text messages) formalization.
In our experiments on SAMSum, <1k parallel training examples is enough to achieve high performance on this task! But how is this useful? (3/9)
Extractive summaries of the perspective shift have a substantially lower hallucination rate than a standard abstractive system over the original dialogues (5% vs 22% in our sample!). (6/9)
Perspective shifted dialogues have the same /meaning/ as the original dialogue, but their /form/ is much closer to "standardized" text like news. This new input form is also the appropriate style for a summary, making extractive summarization feasible. (4/9)
And, an added benefit over a traditional extract-then-abstract system: because of the design of the task, these summaries attribute each claim to the person who sent that message! (7/9)
This allows us to use extractive models trained on news summarization to effectively summarize dialogues!
Adding data from the dialogue domain further improves perf, even when training with model-generated perspective shifts (5/9).
@trashhalo
they're quite different! Alibi requires pretraining + assumes closer tokens are more important to each other; original paper shows 2x length at test time. Unlimiformer is used after pretraining and its behavior is the same regardless of length; it scales to 100x longer inputs
@mrdrozdov
This was a surprise to me as well! I think it's important to point out when our baselines are unexpectedly strong, especially when they're so simple :)
We propose a new style transfer task: rewriting a dialogue from 1st to 3rd person. This is like reframing the conversation: instead of hearing each speaker talk, there's a single 3rd-person description of what each person said. (2/9)
@CustomWetware
(I say "approximately" because we do an approximate kNN search using FAISS. The other factor here would be relative efficiency of the two methods-- though it's possible that's a performance/efficiency tradeoff worth making for some applications.)
@divideconcept
@Tim_Dettmers
thanks for your interest! adding unlimiformer (without training) generally slightly boosts performance from the base model (see table 3/5 for some examples). training with unlimiformer generally significantly improves performance! (1/2)
@joyem4j
Hey! This is a bit outside my research area, so I'm afraid I don't have many concrete recommendations. The AfricaNLP workshops may have some good references; this is the most recent one:
@davidthewid
this is actually a feature of the library we use for nearest neighbors search! FAISS does an approximate search (sort things into buckets, pick the top n most similar buckets to the query, do exact search within those buckets)
@divideconcept
@Tim_Dettmers
slight drawback: we cannot backprop through all embeddings at training time (only the first 16k or so, due to memory constraints).
we propose a workaround for this (chunking inputs into multiple examples), but there's definitely more to be done in this space! (2/2)
@fire
@urialon1
@gneubig
It might be challenging to combine with RMT directly-- I think the key decision would be whether memory tokens are always retrieved (which reduces effective retrieval window) or indexed with the rest of input (which risks losing the memory vector in some decoding steps)
@ericjang11
thanks for the ref! I've read this but must have overlooked it when writing the related work-- we'll be sure to add it in the next version :)
@trashhalo
I'd use alibi to pretrain a model from scratch! If you have a model already pretrained, Unlimiformer can expand context length. Unlimiformer can also be added to an alibi'd model-- not sure how alibi would impact perf, but our method does not modify the position representation
We also observe biases that reflect political values in each era— such as relatively low unpleasantness of weapons during World War II, when Hollywood produced war propaganda, and relatively high unpleasantness of weapons during the Vietnam War Era, when Hollywood was anti-war.
@fire
@urialon1
@gneubig
But it's certainly possible to engineer a solution that merges the two! I don't have a solid guess at what that would do to the performance-- we don't have the capacity to try this right now, but might be interesting to find out