John Hewitt Profile Banner
John Hewitt Profile
John Hewitt

@johnhewtt

4,258
Followers
22
Following
28
Media
155
Statuses

Incoming faculty @columbia CS, fall 2025. PhD @stanford . Understanding and improving neural learning from language. Co-taught CS 224n.

Stanford, CA
Joined February 2015
Don't wanna be here? Send us removal request.
Pinned Tweet
@johnhewtt
John Hewitt
21 days
I’m joining the Columbia Computer Science faculty as an assistant professor in fall 2025, and hiring my first students this upcoming cycle!! There’s so much to understand and improve in neural systems that learn from language — come tackle this with me!
Tweet media one
129
51
899
@johnhewtt
John Hewitt
5 years
Does my unsupervised neural network learn syntax? In new #NAACL2019 paper with @chrmanning , our "structural probe" can show that your word representations embed entire parse trees. paper: blog: code: 1/4
Tweet media one
9
261
833
@johnhewtt
John Hewitt
1 year
For this year's CS 224n: Natural Language Processing with Deep Learning, I've written notes on our Self-Attention and Transformers lecture. Topics: Problems with RNNs, then self-attention, then a 'minimal' self-attention architecture, then Transformers.
Tweet media one
6
155
786
@johnhewtt
John Hewitt
1 year
#acl2023 ! To understand language models, we must know how activation interventions affect predictions for any prefix. Hard for Transformers. Enter: the Backpack. Predictions are a weighted sum of non-contextual word vectors. -> predictable interventions!
7
106
416
@johnhewtt
John Hewitt
8 months
I'm on the faculty market! My goal is to build language systems that we understand deeply through discovery and by design, so we can precisely control them and treat their failures. Let's tackle this grand challenge of science and engineering together.
7
73
411
@johnhewtt
John Hewitt
4 years
#emnlp2020 paper: we give some theoretical insight into the syntactic success of RNN LMs: we prove they can implement bounded-size stacks in their states to generate some bounded hierarchical langs with optimal memory! paper blog
Tweet media one
4
62
333
@johnhewtt
John Hewitt
1 year
Our paper on Backpacks has won an Outstanding Paper Award at ACL 2023! If you're excited about both fascinating learned structure in language models, and designing architectures to enable interpretability while maintaining expressivity, take a read!
Tweet media one
@stanfordnlp
Stanford NLP Group
1 year
Our papers of #ACL2023NLP : Backpack Language Models @johnhewtt , @jwthickstun , @chrmanning , @percyliang Mon July 10, poster 14:00-15:30, Frontenac Ballroom and Queen’s Quay
Tweet media one
0
19
76
5
37
270
@johnhewtt
John Hewitt
7 months
It’s conference time! Come say hello at EMNLP to hear my hot takes on understanding LMs Is your CS department hiring? Hey nice come talk to me! Do you know few people at EMNLP? Not for long; come talk to me! Here’s what I look like at a poster session when the lights go out
Tweet media one
6
15
235
@johnhewtt
John Hewitt
2 years
This winter, I’ll be helping @chrmanning teach NLP with Deep Learning (CS224n). Every year, we attempt to update the course to best teach our students. For this, I am learning from how others teach topics in NLP. Please share your favorite technical explanation of an NLP topic!
9
17
223
@johnhewtt
John Hewitt
3 years
Ever added new words to the vocabulary of your language model only to generate from it and have it generate gibberish? In a technical blog post I detail why this happens, and that representing new words as an average of existing words solves the problem.
Tweet media one
6
50
217
@johnhewtt
John Hewitt
2 years
We characterize and improve on language model _truncation sampling_ algorithms, like top-p and top-k. We frame them as trying to recover the true distribution support from an implicitly _smoothed_ neural LM, and provide a better sampling algo! Paper
Tweet media one
6
34
164
@johnhewtt
John Hewitt
6 years
Learned a lot about LSTM behavior -- in very different ways -- from two excellent @acl2018 papers: Sharp Nearby, Fuzzy Far Away... by @ukhndlwl , He He, Peng Qi, and @jurafsky , and LSTM as Dynamically Computed... by @omerlevy_ , @kentonctlee , @nfitz , @lukezettlemoyer .
0
28
141
@johnhewtt
John Hewitt
4 months
If you're adding new tokens to Gemma, you're likely running into the "all logits are negative, so randomly init embedding with a logit of ~0 dominates the softmax" problem! Averaging existing embeddings solves this by bounding KL from initial model. See:
@Teknium1
Teknium (e/λ)
4 months
Gemma cant handle training with added tokens... maybe you were right @Mascobot - we aint getting chatml yet lol
7
5
67
4
15
122
@johnhewtt
John Hewitt
10 months
Teaching CS224N (twice now!) with @chrmanning has been one of the most rewarding parts of my PhD, not least because the notes and videos are public. Lots of exciting new lectures (RLHF, generation,++) here, as well as refined Transformers and pretraining lectures!
@stanfordnlp
Stanford NLP Group
10 months
A 2023 update of the CS224N Natural Language Processing with Deep Learning YouTube playlist is now available with new lectures on pretrained models, prompting, RLHF, natural language and code generation, linguistics, interpretability and more. #NLProc
Tweet media one
8
283
1K
0
10
107
@johnhewtt
John Hewitt
1 year
I'll be at ACL2023! If you're there and don't know anyone, come say hi! (Or let your students know I'm happy to chat!) I'll be presenting Backpack Language Models and helping give a tutorial on Generating Text from Language Models!
5
3
105
@johnhewtt
John Hewitt
4 years
It's #acl2020nlp and one of the best parts of a conf is meeting new people. If you'd like to chat #nlproc , and especially if you didn't have the money to sign up for the conference, email me to chat for 30min! I can talk research, admissions, grad school++. email on my website!
2
7
97
@johnhewtt
John Hewitt
5 years
How do we design probes that give us insight into a representation? In #emnlp2019 paper with @percyliang , our "control tasks" help us understand the capacity of a probe to make decisions unmotivated by the repr. paper: blog:
Tweet media one
1
23
98
@johnhewtt
John Hewitt
7 months
Guess what it’s STILL conference time this time NeurIPS! Just got in; everything in this tweet holds true, come talk to me
@johnhewtt
John Hewitt
7 months
It’s conference time! Come say hello at EMNLP to hear my hot takes on understanding LMs Is your CS department hiring? Hey nice come talk to me! Do you know few people at EMNLP? Not for long; come talk to me! Here’s what I look like at a poster session when the lights go out
Tweet media one
6
15
235
2
3
87
@johnhewtt
John Hewitt
3 years
How do I 'probe' a representation for just the aspects of a property (like part-of-speech) that aren't captured by a baseline (like word identity?) In #emnlp2021 paper, we propose conditional probing, which does this! paper: blog:
3
10
82
@johnhewtt
John Hewitt
4 years
Very thankful for the chance to give this talk! Students interested in understanding neural representations of language, I’d love if you came and gave your thoughts and perspectives on this ongoing work on the probing methodology.
@NLPwithFriends
NLP with Friends
4 years
We are very excited to announce our next speaker!! 🗣John Hewitt @johnhewtt talking with us about ❓"Language Probes as V-information Estimators" 🗓Sept 9nd, 14:00 UTC 📝Sign up here:
1
22
85
1
11
77
@johnhewtt
John Hewitt
4 years
It's #emnlp2020 and one of the best parts of a conf is meeting new people. If you'd like to chat #nlproc , and especially if you didn't have the money to sign up for the conference, email me to chat for 30min! I can talk research, admissions, grad school++. email on my website!
0
0
74
@johnhewtt
John Hewitt
5 years
So a lot of people have arrived here; please read @nsaphra 's excellent take on neural net probes and @nelsonfliu 's comprehensive neural net probing study, both also at #naacl2019 Saphra: Liu:
@nsaphra
Naomi Saphra
5 years
I'm still prepping the camera-ready for my @naacl paper, but if people take away one thing, I want it to be that they should be specific in what they mean when they say a representation "encodes" some linguistic property, and to recognize the drawbacks of their definition.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
10
86
2
8
47
@johnhewtt
John Hewitt
7 months
Come see this panel I'll speak on! There's so much to understand about language models that it's a good thing we have multiple rich subcommunities with differing perspectives and expertise -- this panel will facilitate sharing ideas and refining goals.
@BlackboxNLP
BlackboxNLP
7 months
BlackboxNLP will this year feature a panel discussion on "Mechanistic Interpretability". We hope this panel may serve as a way of creating stronger bridges between interpretability in NLP and MI! We are now collecting questions for the discussion here:
2
8
44
0
4
47
@johnhewtt
John Hewitt
3 years
In analysis of neural nets, there’s no single right way to “probe” the neural net’s representations. In this opinion piece, we draw from neuroscience to enumerate a few distinct goals of probing and how each guides the design of the probe.
@neuranna
Anna Ivanova
3 years
Check out our short opinion piece where we draw parallels between investigating brains and neural nets! "Probing artificial neural networks: insights from neuroscience" Written with @NogaZaslavsky and @JohnHewtt for the #brain2AI #ICLR2021 workshop. 1/
2
26
89
1
6
46
@johnhewtt
John Hewitt
5 years
I'll be excitedly yammering about structural probes and finding syntax in unsupervised representations today at 4:15 in Nicollet B/C #naacl2019 . Even if you don't ❤️ parse trees, come by to learn a method to tell if your neural network softly encodes tree structures!
0
4
39
@johnhewtt
John Hewitt
3 years
I'm so glad this content is now freely available. As head TA this pas year, I had the privilege of writing and giving 3 lectures: on self-attention & Transformers, pretraining, and model analysis & explanation. I hope many find them useful in their studies!
@stanfordnlp
Stanford NLP Group
3 years
Looking for a series to binge-watch with more depth? We are delighted to make available the latest CS224N: Natural Language Processing with Deep Learning. New content on transformers, pre-trained models, NLG, knowledge, and ethical considerations. #NLProc
Tweet media one
1
42
183
0
3
39
@johnhewtt
John Hewitt
5 years
I enjoyed chatting with @waleed_ammar and @nlpmattg on #nlphighlights about my paper with @chrmanning on finding syntax in word representations. I'm very grateful to have had this opportunity to talk (at length!) about my work!
@waleed_ammar
Waleed Ammar
5 years
#nlphighlights 88: John Hewitt @johnhewtt talks about probing word embeddings for syntax by projecting to a vector space where the L2 distance between a pair of tokens approximates the number of hops between them in the dependency tree.
4
20
75
0
4
39
@johnhewtt
John Hewitt
1 year
This is a nice paper: On the (un)reliability of feature visualizations [Geirhos et al] Shows that vision model feature visualizations don't pass some sniff checks -- they can show plausible things unrelated to behavior on real inputs.
2
7
39
@johnhewtt
John Hewitt
5 years
I'm giving a talk on designing and interpreting probing methods for understanding neural representations at EMNLP, Hall 2C, today at 1:30!
1
0
37
@johnhewtt
John Hewitt
4 years
We split the problem of extrapolation to lengths not seen at train time in NNs into 1. what content to generate? 2. where to put EOS? Give up on 2 and NNs learn very different dynamics; better at 1! BlackBoxNLP Ben Newman, me, @percyliang @chrmanning
Tweet media one
1
8
37
@johnhewtt
John Hewitt
9 months
LMs make low-rank distributions (hidden dim < vocab_size) -> unavoidable errors! But samples are great if you use nucleus/top-k sampling 🤔. Matt: truncation sampling can fix low-rank errors, AND we can use the low-rank basis to find good tokens below the truncation threshold!
Tweet media one
@mattf1n
Matthew Finlayson
9 months
Nucleus and top-k sampling are ubiquitous, but why do they work? @johnhewtt , @alkoller , @swabhz , @Ashish_S_AI and I explain the theory and give a new method to address model errors at their source (the softmax bottleneck)! 📄 🧑‍💻
Tweet media one
3
26
167
0
5
30
@johnhewtt
John Hewitt
4 years
Congratulations to Ben Newman, who spearheaded the work, for winning Outstanding Paper at #BlackBoxNLP , and thanks to the organizers and reviewers for your efforts! Congrats as well to the winners of the other Outstanding Paper award!
0
4
29
@johnhewtt
John Hewitt
5 years
@chrmanning Key idea: Vector spaces have distance metrics (L2); trees do too (# edges between words). Vector spaces have norms (L2); rooted trees do too (# edges between word and ROOT.) Our probe finds a vector distance/norm on word representations that matches all tree distances/norms 2/4
2
1
28
@johnhewtt
John Hewitt
5 years
This claim, that parse trees are embedded through distances and norms on your word representation space, is a structural claim about the word representation space, like how vector offsets encode word analogies in word2vec/GloVE. We hope people have fun exploring this more! 4/4
3
1
24
@johnhewtt
John Hewitt
1 year
My favorite deeper dive experiment in this paper: we wondered if putting the question _before_ the documents would remove the U-shaped effect, since the autoregressive contextualization would "know" what info to look for when processing each doc. Nope! The trend still holds.
1
1
23
@johnhewtt
John Hewitt
3 years
It’s #naacl2021 and one of the best parts of a conf is meeting new people. If you’d like to chat #nlproc , and especially if you couldn’t make it to the conference, email me to chat for 30 min! I can talk research, admissions, grad school++. email on my website!
1
2
20
@johnhewtt
John Hewitt
1 year
We’re coming to the end of #cs224n and it’s so good to see students excitedly discussing the results of their work at the end of the quarter. I’m grateful to our 28 TAs for making the course work.
@stanfordnlp
Stanford NLP Group
1 year
The #cs224n poster session is happening now! We are super excited about amazing, cutting-edge NLP posters from ~650 students!
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
10
56
0
1
18
@johnhewtt
John Hewitt
1 year
Ruth-Ann's great work building a Jamaican Patois Natural Language Inference dataset was picked up by Vox as part of its video "Why AI doesn’t speak every language." Happy to see Ruth-Ann's work (and disparities in NLP across languages) get this general audience coverage.
@ruthstrong_
Ruth-Ann Armstrong
1 year
Check out this Vox video I was featured in where I chat about JamPatoisNLI which I worked on with @chrmanning and @johnhewtt ! Many thanks to @PhilEdwardsInc for platforming our work
2
16
38
0
5
17
@johnhewtt
John Hewitt
8 months
I'm also deeply committed to how open research dovetails with open teaching. I've twice co-taught Stanford's CS 224n: Natural Language Processing with Deep Learning; you can find some of my lectures here and here !
0
0
18
@johnhewtt
John Hewitt
4 years
Excited to give this talk! Tidbits: 1) Could finite-precision RNNs implement (bounded) stacks without access to an external stack? Yes, efficiently! 2) We train probabilistic models in NLP but prove things about acceptors; what if we connect language models to formal languages?
@USC_ISI
USC ISI
4 years
Join John Hewitt @johnhewtt , computer science PhD student at @Stanford , for his talk on November 12 at 11am entitled "The Unreasonable Syntactic Expressivity of RNNs." Details can be found here:
0
0
8
0
1
16
@johnhewtt
John Hewitt
5 years
These distances/norms reconstruct each tree, and are parametrized only by a single linear transformation. What does this mean? In BERT, ELMo, we find syntax trees approximately embedded as a global property of the transformed vector space. (But not in baselines!) 3/4
1
2
15
@johnhewtt
John Hewitt
1 year
Backpacks are an alternative to Transformers: intended to scale in expressivity, yet provide a new kind of interface for interpretability-through-control. A backpack learns k non-contextual sense vectors per subword, unsupervisedly decomposing the subword's predictive uses.
Tweet media one
1
3
14
@johnhewtt
John Hewitt
4 years
Just a few minutes out! Come attend or watch the livestream, or reach out to me afterward if you couldn’t attend but would like to chat about the topic!
@NLPwithFriends
NLP with Friends
4 years
We are very excited to announce our next speaker!! 🗣John Hewitt @johnhewtt talking with us about ❓"Language Probes as V-information Estimators" 🗓Sept 9nd, 14:00 UTC 📝Sign up here:
1
22
85
1
2
14
@johnhewtt
John Hewitt
4 years
This work, with @mhahn29 , @SuryaGanguli , @percyliang , @chrmanning , has been a fascinating and challenging new direction for me, and I'm deeply appreciative to them for enabling me to pursue it. Construction code: Learning code:
0
1
13
@johnhewtt
John Hewitt
8 months
My work has discovered structure in language models - through the structural probe (), refined probing methods (), and formalizing how models construct usable information about the solutions to hard problems ().
Tweet media one
3
0
13
@johnhewtt
John Hewitt
1 year
To represent a word in context, Backpacks use information from the whole context to non-negatively weight the senses of all subwords in the context. So, the contribution of each sense is always towards predicting the same words; only the magnitude changes.
Tweet media one
1
2
12
@johnhewtt
John Hewitt
4 years
Exciting work at #acl2020nlp in characterizing cross-lingual syntactic structure in multilingual BERT! Congrats Ethan!
@ethanachi
Ethan Chi
4 years
Does Multilingual BERT share syntactic knowledge cross-lingually? In #acl2020nlp paper w/ @johnhewtt and @chrmanning , we visualize its syntactic structure & show it's applicable to a variety of human languages. Paper: Blog: (1/4)
Tweet media one
2
34
119
0
0
12
@johnhewtt
John Hewitt
2 years
I’m most interested in in-depth lectures or technical explainers, less interested in surface-level introductions. I’m also focused on (arguably) newer topics, since in these cases, I think personal opinions on the topics tend to come through stronger in pedagogical materials.
0
1
12
@johnhewtt
John Hewitt
3 years
In my blog post, I argue that probing is a clear tool to characterize knowledge in neural networks when we didn't tell the network how to represent that knowledge. The code should be very useful for probing studies!
1
2
11
@johnhewtt
John Hewitt
4 months
By instead initializing new embedding to the average of existing embeddings, you guarantee that the partition function of the softmax grows by at most 1/n where n is the initial vocab size--- so the distrib doesn't change much!
0
0
10
@johnhewtt
John Hewitt
8 months
Further, we can and must design LMs for our understanding, not just performance: I introduced the Backpack, an architecture that brings many of the control and understanding benefits of linear models and word2vec with the power of the Transformer. ()
1
2
11
@johnhewtt
John Hewitt
3 years
I think there's a space of interesting work (and future work) around initializing new word embeddings (e.g., for domain adaptation) using more information -- about orthography, about the finetuning distribution, etc.; averaging will be a baseline to beat.
0
2
10
@johnhewtt
John Hewitt
1 year
I’m deeply thankful to my co-authors on this work, @jwthickstun @chrmanning @percyliang . ArXiv! Demos here! By Lora Xie. Huggingface!
0
0
10
@johnhewtt
John Hewitt
4 months
e.g., >>> torch.max(model(tok('I like pizza', return_tensors='pt')['input_ids']).logits) tensor(-4.5862) So, if you add a new word, since you randomly init the embedding, it gets dot product ~0 with hidden states. Softmax([-4,-4,..., 0]) puts mass on the elt with 0!
1
0
9
@johnhewtt
John Hewitt
6 years
We modeled derivational morphological transformations separately as orthographic and distributional functions, then combined: go see @_danieldeutsch present our paper on English derivational morphology in oral session 6D today at ACL!
0
2
9
@johnhewtt
John Hewitt
1 year
For one example, we observe that certain aspects of gender bias in career nouns (e.g., nurse, CEO) is represented by a Backpack in a particular sense vector (pointing towards, e.g., “he”, “she”.) By “turning down” this sense, this aspect of gender bias is reduced.
Tweet media one
1
0
9
@johnhewtt
John Hewitt
3 years
This can harm finetuning. I also show that a simple, popular heuristic -- just averaging all existing embeddings -- guarantees that adding new words doesn't deviate much from the pretrained LM, solving this problem.
Tweet media one
1
0
8
@johnhewtt
John Hewitt
4 years
Our results are about what's possible, not what's learned. But a drop of empirical results: while RNNs don't learn Dyck-k in practice (), they can learn Dyck-(k,m) well, even with a vanishingly small fraction of the possible stack states seen at training!
1
0
8
@johnhewtt
John Hewitt
2 years
We analyze a few truncation-sampling algorithms, and find that our eta-sampling leads to more plausible long English documents, breaks out of repetition better, and more reasonably truncates low-entropy distributions. With @chrmanning , @percyliang Blog:
Tweet media one
0
0
7
@johnhewtt
John Hewitt
4 years
Our proof is constructive, exactly specifying weights of 1-layer RNNs (and a separate mechanism for the LSTM using just gates) that allow RNNs to push/pop from internal stack and create probability distributions over the next token, encoding what's possible in Dyck-(k,m) strings.
1
0
7
@johnhewtt
John Hewitt
4 years
To make this concrete, k: vocab size. m: max nesting depth. Let's say vocab size is 100k, and max nesting depth is 3. (Empirically, 3 is not a bad approx. of human language.) Then before: approx 10^20 hidden units needed (give or take a few powers). We prove 150 units suffices.
1
0
7
@johnhewtt
John Hewitt
1 year
I was fascinated at the emergent structure of sense vectors, and I’m really excited to see what LM interpretability research the Backpack enables. We can design architectures that scale and learn to do some of the interpretability work for us.
1
0
7
@johnhewtt
John Hewitt
2 years
This work, led by @ruthstrong_ , provides a great new language resource in Jamaican Patois, and studies transfer in multilingual and monolingual LMs! One opportunity: studying how model predictions change as a sentence moves closer to or farther from the high-resource English.
@stanfordnlp
Stanford NLP Group
2 years
JamPatoisNLI provides a dataset and examines how well you can do transfer to low-resource creoles like Jamaican Patois, versus other recent results for low-resource NLI. By @ruthstrong_ @johnhewtt @chrmanning . At Multilingual Rep’n Workshop. #emnlp2022
Tweet media one
0
8
17
0
4
7
@johnhewtt
John Hewitt
4 months
@Teknium1 Right; it's pretty rare. Most models don't have this issue. You can see for yourself by loading up gemma-2b: >>> torch.max(model(tok('I like pizza', return_tensors='pt')['input_ids']).logits) tensor(-4.5862) bc max logit << 0, any new token would dominate in probability
1
0
6
@johnhewtt
John Hewitt
2 years
I’ll be on this panel! Come say hello!
@aclmentorship
ACL Mentorship
2 years
Exciting update: we are opening up our upcoming mentoring session to the public. Use this link to join the webinar on Wednesday:
1
5
17
0
2
6
@johnhewtt
John Hewitt
2 years
On Dec 7, I'll be presenting Truncation Sampling as Language Model Desmoothing At the GEM workshop! One practical takeaway: word-level truncation decisions (from top-p or eta-sampling) can be unintuitive. A colab in which you can try these yourself:
Tweet media one
0
2
6
@johnhewtt
John Hewitt
5 years
@yoavgo re: other languages: I expect so : ) ; we'll see. Some things in the works; lots of follow-up work to be done (hopefully by many people!) re: syntax reps and head choices -- I'd love to hear more about that! Which representations (UD/SD/?) do ELMo/BERT match best? etc.
2
0
6
@johnhewtt
John Hewitt
4 years
Human languages exhibit hierarchical structure, but don't require infinite memory to parse. We study a formal language that reflects this. Bound the memory requirements of Dyck-k (well-nested brackets of k types): you get Dyck-(k,m) (at most m brackets in memory at any time.)
1
0
6
@johnhewtt
John Hewitt
4 years
A simple communication complexity argument proves that O(m log k) hidden units is optimal -- even with unbounded computation (!!), it's impossible to use asymptotically fewer. That is, RNNs are fascinatingly well-suited (imo) to handling bounded-memory hierarchy.
2
0
6
@johnhewtt
John Hewitt
4 years
RNNs can't generate Dyck-k because they have finite memory (with finite precision.) They can generate Dyck-(k,m) because it's a regular language. BUT known constructions would require k^m or so hidden units (exponentially many DFA states!) We prove that O(m log k) suffices!
1
0
6
@johnhewtt
John Hewitt
1 year
Rich decomposition of token meaning makes sense vectors an interesting target for interventions, and identical (in direction) contribution of token senses to log-predictions means that we can know how our interventions will affect any log-prediction of the model.
1
0
5
@johnhewtt
John Hewitt
2 years
Intuitively, in early-stopped neural LMs optimizing KL, there's good reason to put "a bit of probability mass everywhere", to hedge and avoid very high loss. This smoothing is good for scoring, like in n-gram models, but bad for generation, since mass is on non-language strings.
Tweet media one
1
0
5
@johnhewtt
John Hewitt
6 years
Scott Aaronson's note is a delightful introduction to reasoning about large numbers, leading up to the Busy Beaver numbers. Years after finding that article, what fun to find Busy Beaver numbers in proofs on RNNs!
0
2
5
@johnhewtt
John Hewitt
3 years
@chrmanning @MycalTucker @boknilev @stanfordnlp @ethayarajh @percyliang Adding to this: 𝒱-information can be 'constructed' by deterministic transformation (no data processing inequality) and 𝒱, a set of functions (linear or otherwise) is effectively a hypothesis as to how the property is encoded. Also cool: baseline can be any r.v., not just input!
2
0
5
@johnhewtt
John Hewitt
1 year
Interpretability is difficult to evaluate, and often the problems of a method are best shown by empirically demonstrating that certain basic intuitions (e.g., "images that maximally activate a feature relate to the behavior of that feature on real data") don't necessarily hold.
0
0
4
@johnhewtt
John Hewitt
4 years
@ssshanest @hawkrobe Thanks for the question! Short answer: no. Long answer: no, but only because the paper was already 32 pages long (incl. appendix) and focused on theory; I think it's a fascinating Q. My Q: do LSTMs learn something more like our RNN construction or our LSTM construction? Neither?
1
0
4
@johnhewtt
John Hewitt
3 years
@gchrupala @boknilev @jacobandreas @RTomMcCoy @wzuidema @neuranna @NogaZaslavsky @paulsoulos @tallinzen @ltorroba1 This risk of recovering the syntax tree, but only because it's correlated with the presence of particular words, can corrected for in via our 'conditional probing' for decoder studies; could make for interesting comparison ()
1
0
4
@johnhewtt
John Hewitt
5 years
@nsaphra @chrmanning We found positive results on ELMo similar to BERT (Table 1); for unidirectional models (RNN and transformer alike) it's still unclear (unpublished expts).
0
0
3
@johnhewtt
John Hewitt
3 years
Intuitively, this happens when the dot product of the new word embeddings with the neural representation of the prefix (zero-ish) is a large value relative to the other unnormalized probabilities (logits) of the language model -- seems to happen "by accident" sometimes.
1
0
3
@johnhewtt
John Hewitt
3 years
These new words won't be useful until finetuning, but I was surprised to see that for some models, adding the words actually causes the LM to put all its probability on only the new words.
1
0
3
@johnhewtt
John Hewitt
2 years
Our model of this leads to two principles for truncation to desmooth -- never truncate high-probability words, and never truncate words that are relatively high-probability for that particular distribution. Surprisingly, top-p breaks the first principle.
1
0
3
@johnhewtt
John Hewitt
5 years
@yoavgo Finally, on the rank experiments -- our intuitions are similar. English parse tree distance matrices full-rank-ish in the length of the sentence. unpublished: training struct probe on only shorter sentences still leads to rank 64 probe. Maybe fun connection with l1-embeddability?
1
0
3
@johnhewtt
John Hewitt
4 years
0
0
3
@johnhewtt
John Hewitt
4 years
@ssshanest @hawkrobe Because we provide PyTorch implementations of the constructions (and code to train models), these analyses are possible to do right away! I'd love to see them done, but they're not my immediate todo. Would be happy to chat with anyone interested.
0
0
3
@johnhewtt
John Hewitt
2 years
Truncation sampling algorithms restrict the neural LM support (words with non-zero probability.) We interpret this as approximating the true distribution support and ask, words of what probability are likely to be in that support?
Tweet media one
1
0
3
@johnhewtt
John Hewitt
5 years
@yoavgo Totally agreed about controls for other trees and the extent to which they're encoded. Thinking about what the right kind of spurious trees that aren't easily recoverable from simple transform of linear structure + parse structure. Random needs some consistency to the test set.
2
0
3
@johnhewtt
John Hewitt
5 years
@percyliang We claim that good probes are "selective," achieving high accuracy on linguistic tasks, and low acc on control tasks. Between probes, small gains in linguistic acc can correspond to big selectivity losses; gains may be from added probe capacity, not repr properties.
1
0
2
@johnhewtt
John Hewitt
3 years
So what if BERT only captured the ambiguous POS cases? It would explain _less_ about POS than does the word identity. Probing would indicate this with a negative result. But really, in this case BERT captures an important part of POS! So, how do we measure it?
Tweet media one
1
0
2
@johnhewtt
John Hewitt
3 years
Further, we show that conditional probing actually can estimate the conditional _usable_ information from the representation to the property, after conditioning on the baseline! I(BERT->POS | word identity) So it's not just an intuitive tool; it has a clear interpretation.
1
0
2
@johnhewtt
John Hewitt
18 days
0
0
1
@johnhewtt
John Hewitt
6 years
Wondering under what circumstances visual signal is useful in translation? Feeling a desire for multimodal, multilingual NLP? Use our dataset of images representing words across 100 languages, and check out our poster in Session 3E with Daphne Ippolito
0
0
2
@johnhewtt
John Hewitt
18 days
@tallinzen Thank you, Tal! Looking forward to some NYC NLP socials!
0
0
2
@johnhewtt
John Hewitt
4 years
This was true of a bracket closing task and SCAN length extrapolation splits. On MT (De-En): no conclusive results. Bonus: can sometimes train a post-hoc model to predict EOS after the fact. A promising direction?? Congrats to Ben on spearheading a great analysis paper.
0
0
2
@johnhewtt
John Hewitt
3 years
Our conditional probing method does this, and it's as simple as training probe 1 on the baseline, and probe 2 on the concatenation of baseline and representation! Intuitively, now we measure just what "extra" the representation provides, not what it lacks.
Tweet media one
1
0
2
@johnhewtt
John Hewitt
3 years
@esalesk So, some thoughts: (1) averaging fewer words might be better in terms of having more semantically meaningful inputs, but (2) averaging more words gives you a closer approximation to the original LM. Could be a fun tradeoff to explore.
1
0
2
@johnhewtt
John Hewitt
2 years
Our eta-sampling both (1) never truncates a word that's higher-prob than a hyperparam threshold, and (2) never truncates a word higher-prob than an entropy-dependent threshold. Top-p, in contrast, truncates high-prob words (e.g., only allowing "is" for the prefix "My name")
Tweet media one
1
0
2
@johnhewtt
John Hewitt
3 years
If we’ve chatted like this at a previous conference, remind me when you reach out! I’d love to know how things have gone, and what advice or opportunities turned out to be useful, in part so I can improve how useful I can be to folks in the future.
0
0
2
@johnhewtt
John Hewitt
3 years
This work with fellow Stanford NLP PhD @ethayarajh , and my advisors @percyliang and @chrmanning ; my thanks to them for enduring a long and winding process with this paper, e.g., reformulating 𝒱-information to capture conditional information quantities (see Appendix!)
0
0
2
@johnhewtt
John Hewitt
5 years
@gattardi @chrmanning I'll plug the code while I'm at it -- pre-trained probe for BERT-large layer 16 (and easy interface to BERT) makes these quick questions easy to ask!
Tweet media one
Tweet media two
1
0
2
@johnhewtt
John Hewitt
5 years
Lots of hyperparameters when designing probes, and probing results conflate representation, probe, and data, making interpretration difficult. A control task can help design, and help interpret. code:
0
0
2