I’m joining the Columbia Computer Science faculty as an assistant professor in fall 2025, and hiring my first students this upcoming cycle!!
There’s so much to understand and improve in neural systems that learn from language — come tackle this with me!
Does my unsupervised neural network learn syntax? In new
#NAACL2019
paper with
@chrmanning
, our "structural probe" can show that your word representations embed entire parse trees.
paper:
blog:
code:
1/4
For this year's CS 224n: Natural Language Processing with Deep Learning, I've written notes on our Self-Attention and Transformers lecture.
Topics: Problems with RNNs, then self-attention, then a 'minimal' self-attention architecture, then Transformers.
#acl2023
! To understand language models, we must know how activation interventions affect predictions for any prefix. Hard for Transformers.
Enter: the Backpack. Predictions are a weighted sum of non-contextual word vectors.
-> predictable interventions!
I'm on the faculty market!
My goal is to build language systems that we understand deeply through discovery and by design, so we can precisely control them and treat their failures.
Let's tackle this grand challenge of science and engineering together.
#emnlp2020
paper: we give some theoretical insight into the syntactic success of RNN LMs: we prove they can implement bounded-size stacks in their states to generate some bounded hierarchical langs with optimal memory!
paper
blog
Our paper on Backpacks has won an Outstanding Paper Award at ACL 2023!
If you're excited about both fascinating learned structure in language models, and designing architectures to enable interpretability while maintaining expressivity, take a read!
It’s conference time! Come say hello at EMNLP to hear my hot takes on understanding LMs
Is your CS department hiring? Hey nice come talk to me!
Do you know few people at EMNLP? Not for long; come talk to me!
Here’s what I look like at a poster session when the lights go out
This winter, I’ll be helping
@chrmanning
teach NLP with Deep Learning (CS224n). Every year, we attempt to update the course to best teach our students. For this, I am learning from how others teach topics in NLP. Please share your favorite technical explanation of an NLP topic!
Ever added new words to the vocabulary of your language model only to generate from it and have it generate gibberish?
In a technical blog post I detail why this happens, and that representing new words as an average of existing words solves the problem.
We characterize and improve on language model _truncation sampling_ algorithms, like top-p and top-k. We frame them as trying to recover the true distribution support from an implicitly _smoothed_ neural LM, and provide a better sampling algo!
Paper
If you're adding new tokens to Gemma, you're likely running into the "all logits are negative, so randomly init embedding with a logit of ~0 dominates the softmax" problem! Averaging existing embeddings solves this by bounding KL from initial model. See:
Teaching CS224N (twice now!) with
@chrmanning
has been one of the most rewarding parts of my PhD, not least because the notes and videos are public. Lots of exciting new lectures (RLHF, generation,++) here, as well as refined Transformers and pretraining lectures!
A 2023 update of the CS224N Natural Language Processing with Deep Learning YouTube playlist is now available with new lectures on pretrained models, prompting, RLHF, natural language and code generation, linguistics, interpretability and more.
#NLProc
I'll be at ACL2023! If you're there and don't know anyone, come say hi! (Or let your students know I'm happy to chat!)
I'll be presenting Backpack Language Models
and helping give a tutorial on Generating Text from Language Models!
It's
#acl2020nlp
and one of the best parts of a conf is meeting new people. If you'd like to chat
#nlproc
, and especially if you didn't have the money to sign up for the conference, email me to chat for 30min! I can talk research, admissions, grad school++. email on my website!
How do we design probes that give us insight into a representation? In
#emnlp2019
paper with
@percyliang
, our "control tasks" help us understand the capacity of a probe to make decisions unmotivated by the repr.
paper:
blog:
It’s conference time! Come say hello at EMNLP to hear my hot takes on understanding LMs
Is your CS department hiring? Hey nice come talk to me!
Do you know few people at EMNLP? Not for long; come talk to me!
Here’s what I look like at a poster session when the lights go out
How do I 'probe' a representation for just the aspects of a property (like part-of-speech) that aren't captured by a baseline (like word identity?) In
#emnlp2021
paper, we propose conditional probing, which does this!
paper:
blog:
Very thankful for the chance to give this talk! Students interested in understanding neural representations of language, I’d love if you came and gave your thoughts and perspectives on this ongoing work on the probing methodology.
We are very excited to announce our next speaker!!
🗣John Hewitt
@johnhewtt
talking with us about
❓"Language Probes as V-information Estimators"
🗓Sept 9nd, 14:00 UTC
📝Sign up here:
It's
#emnlp2020
and one of the best parts of a conf is meeting new people. If you'd like to chat
#nlproc
, and especially if you didn't have the money to sign up for the conference, email me to chat for 30min! I can talk research, admissions, grad school++. email on my website!
So a lot of people have arrived here; please read
@nsaphra
's excellent take on neural net probes and
@nelsonfliu
's comprehensive neural net probing study, both also at
#naacl2019
Saphra:
Liu:
I'm still prepping the camera-ready for my
@naacl
paper, but if people take away one thing, I want it to be that they should be specific in what they mean when they say a representation "encodes" some linguistic property, and to recognize the drawbacks of their definition.
Come see this panel I'll speak on! There's so much to understand about language models that it's a good thing we have multiple rich subcommunities with differing perspectives and expertise -- this panel will facilitate sharing ideas and refining goals.
BlackboxNLP will this year feature a panel discussion on "Mechanistic Interpretability". We hope this panel may serve as a way of creating stronger bridges between interpretability in NLP and MI!
We are now collecting questions for the discussion here:
In analysis of neural nets, there’s no single right way to “probe” the neural net’s representations. In this opinion piece, we draw from neuroscience to enumerate a few distinct goals of probing and how each guides the design of the probe.
Check out our short opinion piece where we draw parallels between investigating brains and neural nets!
"Probing artificial neural networks: insights from neuroscience"
Written with
@NogaZaslavsky
and
@JohnHewtt
for the
#brain2AI
#ICLR2021
workshop.
1/
I'll be excitedly yammering about structural probes and finding syntax in unsupervised representations today at 4:15 in Nicollet B/C
#naacl2019
. Even if you don't ❤️ parse trees, come by to learn a method to tell if your neural network softly encodes tree structures!
I'm so glad this content is now freely available. As head TA this pas year, I had the privilege of writing and giving 3 lectures: on self-attention & Transformers, pretraining, and model analysis & explanation. I hope many find them useful in their studies!
Looking for a series to binge-watch with more depth? We are delighted to make available the latest CS224N: Natural Language Processing with Deep Learning. New content on transformers, pre-trained models, NLG, knowledge, and ethical considerations.
#NLProc
I enjoyed chatting with
@waleed_ammar
and
@nlpmattg
on
#nlphighlights
about my paper with
@chrmanning
on finding syntax in word representations. I'm very grateful to have had this opportunity to talk (at length!) about my work!
#nlphighlights
88: John Hewitt
@johnhewtt
talks about probing word embeddings for syntax by projecting to a vector space where the L2 distance between a pair of tokens approximates the number of hops between them in the dependency tree.
This is a nice paper:
On the (un)reliability of feature visualizations
[Geirhos et al]
Shows that vision model feature visualizations don't pass some sniff checks -- they can show plausible things unrelated to behavior on real inputs.
We split the problem of extrapolation to lengths not seen at train time in NNs into 1. what content to generate? 2. where to put EOS? Give up on 2 and NNs learn very different dynamics; better at 1!
BlackBoxNLP
Ben Newman, me,
@percyliang
@chrmanning
LMs make low-rank distributions (hidden dim < vocab_size) -> unavoidable errors! But samples are great if you use nucleus/top-k sampling 🤔.
Matt: truncation sampling can fix low-rank errors, AND we can use the low-rank basis to find good tokens below the truncation threshold!
Nucleus and top-k sampling are ubiquitous, but why do they work?
@johnhewtt
,
@alkoller
,
@swabhz
,
@Ashish_S_AI
and I explain the theory and give a new method to address model errors at their source (the softmax bottleneck)!
📄
🧑💻
Congratulations to Ben Newman, who spearheaded the work, for winning Outstanding Paper at
#BlackBoxNLP
, and thanks to the organizers and reviewers for your efforts! Congrats as well to the winners of the other Outstanding Paper award!
@chrmanning
Key idea: Vector spaces have distance metrics (L2); trees do too (# edges between words). Vector spaces have norms (L2); rooted trees do too (# edges between word and ROOT.) Our probe finds a vector distance/norm on word representations that matches all tree distances/norms 2/4
This claim, that parse trees are embedded through distances and norms on your word representation space, is a structural claim about the word representation space, like how vector offsets encode word analogies in word2vec/GloVE. We hope people have fun exploring this more! 4/4
My favorite deeper dive experiment in this paper: we wondered if putting the question _before_ the documents would remove the U-shaped effect, since the autoregressive contextualization would "know" what info to look for when processing each doc. Nope! The trend still holds.
It’s
#naacl2021
and one of the best parts of a conf is meeting new people. If you’d like to chat
#nlproc
, and especially if you couldn’t make it to the conference, email me to chat for 30 min! I can talk research, admissions, grad school++. email on my website!
We’re coming to the end of
#cs224n
and it’s so good to see students excitedly discussing the results of their work at the end of the quarter. I’m grateful to our 28 TAs for making the course work.
Ruth-Ann's great work building a Jamaican Patois Natural Language Inference dataset was picked up by Vox as part of its video "Why AI doesn’t speak every language." Happy to see Ruth-Ann's work (and disparities in NLP across languages) get this general audience coverage.
Check out this Vox video I was featured in where I chat about JamPatoisNLI which I worked on with
@chrmanning
and
@johnhewtt
! Many thanks to
@PhilEdwardsInc
for platforming our work
I'm also deeply committed to how open research dovetails with open teaching. I've twice co-taught Stanford's CS 224n: Natural Language Processing with Deep Learning; you can find some of my lectures here and here !
Excited to give this talk! Tidbits:
1) Could finite-precision RNNs implement (bounded) stacks without access to an external stack? Yes, efficiently!
2) We train probabilistic models in NLP but prove things about acceptors; what if we connect language models to formal languages?
Join John Hewitt
@johnhewtt
, computer science PhD student at
@Stanford
, for his talk on November 12 at 11am entitled "The Unreasonable Syntactic Expressivity of RNNs." Details can be found here:
These distances/norms reconstruct each tree, and are parametrized only by a single linear transformation. What does this mean? In BERT, ELMo, we find syntax trees approximately embedded as a global property of the transformed vector space. (But not in baselines!) 3/4
Backpacks are an alternative to Transformers: intended to scale in expressivity, yet provide a new kind of interface for interpretability-through-control.
A backpack learns k non-contextual sense vectors per subword, unsupervisedly decomposing the subword's predictive uses.
Just a few minutes out! Come attend or watch the livestream, or reach out to me afterward if you couldn’t attend but would like to chat about the topic!
We are very excited to announce our next speaker!!
🗣John Hewitt
@johnhewtt
talking with us about
❓"Language Probes as V-information Estimators"
🗓Sept 9nd, 14:00 UTC
📝Sign up here:
This work, with
@mhahn29
,
@SuryaGanguli
,
@percyliang
,
@chrmanning
, has been a fascinating and challenging new direction for me, and I'm deeply appreciative to them for enabling me to pursue it.
Construction code:
Learning code:
My work has discovered structure in language models - through the structural probe (), refined probing methods (), and formalizing how models construct usable information about the solutions to hard problems ().
To represent a word in context, Backpacks use information from the whole context to non-negatively weight the senses of all subwords in the context.
So, the contribution of each sense is always towards predicting the same words; only the magnitude changes.
Does Multilingual BERT share syntactic knowledge cross-lingually? In
#acl2020nlp
paper w/
@johnhewtt
and
@chrmanning
, we visualize its syntactic structure & show it's applicable to a variety of human languages.
Paper:
Blog: (1/4)
I’m most interested in in-depth lectures or technical explainers, less interested in surface-level introductions. I’m also focused on (arguably) newer topics, since in these cases, I think personal opinions on the topics tend to come through stronger in pedagogical materials.
In my blog post, I argue that probing is a clear tool to characterize knowledge in neural networks when we didn't tell the network how to represent that knowledge.
The code should be very useful for probing studies!
By instead initializing new embedding to the average of existing embeddings, you guarantee that the partition function of the softmax grows by at most 1/n where n is the initial vocab size--- so the distrib doesn't change much!
Further, we can and must design LMs for our understanding, not just performance: I introduced the Backpack, an architecture that brings many of the control and understanding benefits of linear models and word2vec with the power of the Transformer. ()
I think there's a space of interesting work (and future work) around initializing new word embeddings (e.g., for domain adaptation) using more information -- about orthography, about the finetuning distribution, etc.; averaging will be a baseline to beat.
e.g.,
>>> torch.max(model(tok('I like pizza', return_tensors='pt')['input_ids']).logits)
tensor(-4.5862)
So, if you add a new word, since you randomly init the embedding, it gets dot product ~0 with hidden states. Softmax([-4,-4,..., 0]) puts mass on the elt with 0!
We modeled derivational morphological transformations separately as orthographic and distributional functions, then combined: go see
@_danieldeutsch
present our paper on English derivational morphology in oral session 6D today at ACL!
For one example, we observe that certain aspects of gender bias in career nouns (e.g., nurse, CEO) is represented by a Backpack in a particular sense vector (pointing towards, e.g., “he”, “she”.) By “turning down” this sense, this aspect of gender bias is reduced.
This can harm finetuning. I also show that a simple, popular heuristic -- just averaging all existing embeddings -- guarantees that adding new words doesn't deviate much from the pretrained LM, solving this problem.
Our results are about what's possible, not what's learned. But a drop of empirical results: while RNNs don't learn Dyck-k in practice (), they can learn Dyck-(k,m) well, even with a vanishingly small fraction of the possible stack states seen at training!
We analyze a few truncation-sampling algorithms, and find that our eta-sampling leads to more plausible long English documents, breaks out of repetition better, and more reasonably truncates low-entropy distributions.
With
@chrmanning
,
@percyliang
Blog:
Our proof is constructive, exactly specifying weights of 1-layer RNNs (and a separate mechanism for the LSTM using just gates) that allow RNNs to push/pop from internal stack and create probability distributions over the next token, encoding what's possible in Dyck-(k,m) strings.
To make this concrete, k: vocab size. m: max nesting depth. Let's say vocab size is 100k, and max nesting depth is 3. (Empirically, 3 is not a bad approx. of human language.)
Then before: approx 10^20 hidden units needed (give or take a few powers). We prove 150 units suffices.
I was fascinated at the emergent structure of sense vectors, and I’m really excited to see what LM interpretability research the Backpack enables.
We can design architectures that scale and learn to do some of the interpretability work for us.
This work, led by
@ruthstrong_
, provides a great new language resource in Jamaican Patois, and studies transfer in multilingual and monolingual LMs! One opportunity: studying how model predictions change as a sentence moves closer to or farther from the high-resource English.
JamPatoisNLI provides a dataset and examines how well you can do transfer to low-resource creoles like Jamaican Patois, versus other recent results for low-resource NLI. By
@ruthstrong_
@johnhewtt
@chrmanning
. At Multilingual Rep’n Workshop.
#emnlp2022
@Teknium1
Right; it's pretty rare. Most models don't have this issue. You can see for yourself by loading up gemma-2b:
>>> torch.max(model(tok('I like pizza', return_tensors='pt')['input_ids']).logits)
tensor(-4.5862)
bc max logit << 0, any new token would dominate in probability
On Dec 7, I'll be presenting
Truncation Sampling as Language Model Desmoothing
At the GEM workshop! One practical takeaway: word-level truncation decisions (from top-p or eta-sampling) can be unintuitive. A colab in which you can try these yourself:
@yoavgo
re: other languages: I expect so : ) ; we'll see. Some things in the works; lots of follow-up work to be done (hopefully by many people!)
re: syntax reps and head choices -- I'd love to hear more about that! Which representations (UD/SD/?) do ELMo/BERT match best? etc.
Human languages exhibit hierarchical structure, but don't require infinite memory to parse. We study a formal language that reflects this. Bound the memory requirements of Dyck-k (well-nested brackets of k types): you get Dyck-(k,m) (at most m brackets in memory at any time.)
A simple communication complexity argument proves that O(m log k) hidden units is optimal -- even with unbounded computation (!!), it's impossible to use asymptotically fewer. That is, RNNs are fascinatingly well-suited (imo) to handling bounded-memory hierarchy.
RNNs can't generate Dyck-k because they have finite memory (with finite precision.) They can generate Dyck-(k,m) because it's a regular language. BUT known constructions would require k^m or so hidden units (exponentially many DFA states!) We prove that O(m log k) suffices!
Rich decomposition of token meaning makes sense vectors an interesting target for interventions, and identical (in direction) contribution of token senses to log-predictions means that we can know how our interventions will affect any log-prediction of the model.
Intuitively, in early-stopped neural LMs optimizing KL, there's good reason to put "a bit of probability mass everywhere", to hedge and avoid very high loss. This smoothing is good for scoring, like in n-gram models, but bad for generation, since mass is on non-language strings.
Scott Aaronson's note is a delightful introduction to reasoning about large numbers, leading up to the Busy Beaver numbers. Years after finding that article, what fun to find Busy Beaver numbers in proofs on RNNs!
@chrmanning
@MycalTucker
@boknilev
@stanfordnlp
@ethayarajh
@percyliang
Adding to this: 𝒱-information can be 'constructed' by deterministic transformation (no data processing inequality) and 𝒱, a set of functions (linear or otherwise) is effectively a hypothesis as to how the property is encoded. Also cool: baseline can be any r.v., not just input!
Interpretability is difficult to evaluate, and often the problems of a method are best shown by empirically demonstrating that certain basic intuitions (e.g., "images that maximally activate a feature relate to the behavior of that feature on real data") don't necessarily hold.
@ssshanest
@hawkrobe
Thanks for the question! Short answer: no. Long answer: no, but only because the paper was already 32 pages long (incl. appendix) and focused on theory; I think it's a fascinating Q. My Q: do LSTMs learn something more like our RNN construction or our LSTM construction? Neither?
@nsaphra
@chrmanning
We found positive results on ELMo similar to BERT (Table 1); for unidirectional models (RNN and transformer alike) it's still unclear (unpublished expts).
Intuitively, this happens when the dot product of the new word embeddings with the neural representation of the prefix (zero-ish) is a large value relative to the other unnormalized probabilities (logits) of the language model -- seems to happen "by accident" sometimes.
These new words won't be useful until finetuning, but I was surprised to see that for some models, adding the words actually causes the LM to put all its probability on only the new words.
Our model of this leads to two principles for truncation to desmooth -- never truncate high-probability words, and never truncate words that are relatively high-probability for that particular distribution. Surprisingly, top-p breaks the first principle.
@yoavgo
Finally, on the rank experiments -- our intuitions are similar. English parse tree distance matrices full-rank-ish in the length of the sentence. unpublished: training struct probe on only shorter sentences still leads to rank 64 probe. Maybe fun connection with l1-embeddability?
@ssshanest
@hawkrobe
Because we provide PyTorch implementations of the constructions (and code to train models), these analyses are possible to do right away! I'd love to see them done, but they're not my immediate todo. Would be happy to chat with anyone interested.
Truncation sampling algorithms restrict the neural LM support (words with non-zero probability.) We interpret this as approximating the true distribution support and ask, words of what probability are likely to be in that support?
@yoavgo
Totally agreed about controls for other trees and the extent to which they're encoded. Thinking about what the right kind of spurious trees that aren't easily recoverable from simple transform of linear structure + parse structure. Random needs some consistency to the test set.
@percyliang
We claim that good probes are "selective," achieving high accuracy on linguistic tasks, and low acc on control tasks. Between probes, small gains in linguistic acc can correspond to big selectivity losses; gains may be from added probe capacity, not repr properties.
So what if BERT only captured the ambiguous POS cases? It would explain _less_ about POS than does the word identity. Probing would indicate this with a negative result.
But really, in this case BERT captures an important part of POS! So, how do we measure it?
Further, we show that conditional probing actually can estimate the conditional _usable_ information from the representation to the property, after conditioning on the baseline! I(BERT->POS | word identity) So it's not just an intuitive tool; it has a clear interpretation.
Wondering under what circumstances visual signal is useful in translation? Feeling a desire for multimodal, multilingual NLP? Use our dataset of images representing words across 100 languages, and check out our poster in Session 3E with Daphne Ippolito
This was true of a bracket closing task and SCAN length extrapolation splits. On MT (De-En): no conclusive results. Bonus: can sometimes train a post-hoc model to predict EOS after the fact. A promising direction??
Congrats to Ben on spearheading a great analysis paper.
Our conditional probing method does this, and it's as simple as training probe 1 on the baseline, and probe 2 on the concatenation of baseline and representation! Intuitively, now we measure just what "extra" the representation provides, not what it lacks.
@esalesk
So, some thoughts: (1) averaging fewer words might be better in terms of having more semantically meaningful inputs, but (2) averaging more words gives you a closer approximation to the original LM. Could be a fun tradeoff to explore.
Our eta-sampling both (1) never truncates a word that's higher-prob than a hyperparam threshold, and (2) never truncates a word higher-prob than an entropy-dependent threshold. Top-p, in contrast, truncates high-prob words (e.g., only allowing "is" for the prefix "My name")
If we’ve chatted like this at a previous conference, remind me when you reach out! I’d love to know how things have gone, and what advice or opportunities turned out to be useful, in part so I can improve how useful I can be to folks in the future.
This work with fellow Stanford NLP PhD
@ethayarajh
, and my advisors
@percyliang
and
@chrmanning
; my thanks to them for enduring a long and winding process with this paper, e.g., reformulating 𝒱-information to capture conditional information quantities (see Appendix!)
@gattardi
@chrmanning
I'll plug the code while I'm at it -- pre-trained probe for BERT-large layer 16 (and easy interface to BERT) makes these quick questions easy to ask!
Lots of hyperparameters when designing probes, and probing results conflate representation, probe, and data, making interpretration difficult.
A control task can help design, and help interpret.
code: