John Hewitt @johnhewtt Twitter profile | Pikagi

Pikagi

John Hewitt

@johnhewtt

4,258

Followers

22

Following

28

Media

155

Statuses

Incoming faculty @columbia CS, fall 2025. PhD @stanford . Understanding and improving neural learning from language. Co-taught CS 224n.

Stanford, CA

https://t.co/wpIY93L4XT

Joined February 2015

Don't wanna be here? Send us removal request.

Pinned Tweet

@johnhewtt

John Hewitt

21 days

I’m joining the Columbia Computer Science faculty as an assistant professor in fall 2025, and hiring my first students this upcoming cycle!! There’s so much to understand and improve in neural systems that learn from language — come tackle this with me!

Tweet media one

129

51

899

Last Seen Profiles

@Napat1126513

@YourThor4

@TechRadarNL

@Danny_Edwin18

@bdlkrym82848522

@Romeo198212

@MilkingMyMan

@inkfit_

@megadongxvip

@celltowerr

@Hezeer_

@Derekhdzzz

@cedricreso

@Rendy11373620

@pengen_stw

@nagyszabi900

@raicowe

@ibraksa89

@k_a_h_o_chan

@CaptJolWoodard

@legoktm

@ZanetaKhokhar

@SANDEEPJAI51325

@FSIStanford

@9Skey

@DanielTrevisa11

@9DOC_N_GR

@whsperformarts

@nachoprietook

@BSnowND

@starlinos_

@ConLa12121212

@Cahbombscustoms

@b4raa20

@42_combs

@obeliskthevark

@johnhewtt

John Hewitt

5 years

Does my unsupervised neural network learn syntax? In new #NAACL2019 paper with @chrmanning , our "structural probe" can show that your word representations embed entire parse trees. paper: blog: code: 1/4

Tweet media one

9

261

833

@johnhewtt

John Hewitt

1 year

For this year's CS 224n: Natural Language Processing with Deep Learning, I've written notes on our Self-Attention and Transformers lecture. Topics: Problems with RNNs, then self-attention, then a 'minimal' self-attention architecture, then Transformers.

Tweet media one

6

155

786

@johnhewtt

John Hewitt

1 year

#acl2023 ! To understand language models, we must know how activation interventions affect predictions for any prefix. Hard for Transformers. Enter: the Backpack. Predictions are a weighted sum of non-contextual word vectors. -> predictable interventions!

7

106

416

@johnhewtt

John Hewitt

8 months

I'm on the faculty market! My goal is to build language systems that we understand deeply through discovery and by design, so we can precisely control them and treat their failures. Let's tackle this grand challenge of science and engineering together.

7

73

411

@johnhewtt

John Hewitt

4 years

#emnlp2020 paper: we give some theoretical insight into the syntactic success of RNN LMs: we prove they can implement bounded-size stacks in their states to generate some bounded hierarchical langs with optimal memory! paper blog

Tweet media one

4

62

333

@johnhewtt

John Hewitt

1 year

Our paper on Backpacks has won an Outstanding Paper Award at ACL 2023! If you're excited about both fascinating learned structure in language models, and designing architectures to enable interpretability while maintaining expressivity, take a read!

Tweet media one

@stanfordnlp

Stanford NLP Group

1 year

Our papers of #ACL2023NLP : Backpack Language Models @johnhewtt , @jwthickstun , @chrmanning , @percyliang Mon July 10, poster 14:00-15:30, Frontenac Ballroom and Queen’s Quay

Tweet media one

0

19

76

5

37

270

@johnhewtt

John Hewitt

7 months

It’s conference time! Come say hello at EMNLP to hear my hot takes on understanding LMs Is your CS department hiring? Hey nice come talk to me! Do you know few people at EMNLP? Not for long; come talk to me! Here’s what I look like at a poster session when the lights go out

Tweet media one

6

15

235

@johnhewtt

John Hewitt

2 years

This winter, I’ll be helping @chrmanning teach NLP with Deep Learning (CS224n). Every year, we attempt to update the course to best teach our students. For this, I am learning from how others teach topics in NLP. Please share your favorite technical explanation of an NLP topic!

9

17

223

@johnhewtt

John Hewitt

3 years

Ever added new words to the vocabulary of your language model only to generate from it and have it generate gibberish? In a technical blog post I detail why this happens, and that representing new words as an average of existing words solves the problem.

Tweet media one

6

50

217

@johnhewtt

John Hewitt

2 years

We characterize and improve on language model _truncation sampling_ algorithms, like top-p and top-k. We frame them as trying to recover the true distribution support from an implicitly _smoothed_ neural LM, and provide a better sampling algo! Paper

Tweet media one

6

34

164

@johnhewtt

John Hewitt

6 years

Learned a lot about LSTM behavior -- in very different ways -- from two excellent @acl2018 papers: Sharp Nearby, Fuzzy Far Away... by @ukhndlwl , He He, Peng Qi, and @jurafsky , and LSTM as Dynamically Computed... by @omerlevy_ , @kentonctlee , @nfitz , @lukezettlemoyer .

0

28

141

@johnhewtt

John Hewitt

4 months

If you're adding new tokens to Gemma, you're likely running into the "all logits are negative, so randomly init embedding with a logit of ~0 dominates the softmax" problem! Averaging existing embeddings solves this by bounding KL from initial model. See:

@Teknium1

Teknium (e/λ)

4 months

Gemma cant handle training with added tokens... maybe you were right @Mascobot - we aint getting chatml yet lol

7

5

67

4

15

122

@johnhewtt

John Hewitt

10 months

Teaching CS224N (twice now!) with @chrmanning has been one of the most rewarding parts of my PhD, not least because the notes and videos are public. Lots of exciting new lectures (RLHF, generation,++) here, as well as refined Transformers and pretraining lectures!

@stanfordnlp

Stanford NLP Group

10 months

A 2023 update of the CS224N Natural Language Processing with Deep Learning YouTube playlist is now available with new lectures on pretrained models, prompting, RLHF, natural language and code generation, linguistics, interpretability and more. #NLProc

Tweet media one

8

283

1K

0

10

107

@johnhewtt

John Hewitt

1 year

I'll be at ACL2023! If you're there and don't know anyone, come say hi! (Or let your students know I'm happy to chat!) I'll be presenting Backpack Language Models and helping give a tutorial on Generating Text from Language Models!

Backpack Models

Backpacks are a drop-in replacement for Transformers that enable contextual control through non-contextual interventions.

backpackmodels.science

5

3

105

@johnhewtt

John Hewitt

4 years

It's #acl2020nlp and one of the best parts of a conf is meeting new people. If you'd like to chat #nlproc , and especially if you didn't have the money to sign up for the conference, email me to chat for 30min! I can talk research, admissions, grad school++. email on my website!

2

7

97

@johnhewtt

John Hewitt

5 years

How do we design probes that give us insight into a representation? In #emnlp2019 paper with @percyliang , our "control tasks" help us understand the capacity of a probe to make decisions unmotivated by the repr. paper: blog:

Tweet media one

1

23

98

@johnhewtt

John Hewitt

7 months

Guess what it’s STILL conference time this time NeurIPS! Just got in; everything in this tweet holds true, come talk to me

@johnhewtt

John Hewitt

7 months

It’s conference time! Come say hello at EMNLP to hear my hot takes on understanding LMs Is your CS department hiring? Hey nice come talk to me! Do you know few people at EMNLP? Not for long; come talk to me! Here’s what I look like at a poster session when the lights go out

Tweet media one

6

15

235

2

3

87

@johnhewtt

John Hewitt

3 years

How do I 'probe' a representation for just the aspects of a property (like part-of-speech) that aren't captured by a baseline (like word identity?) In #emnlp2021 paper, we propose conditional probing, which does this! paper: blog:

Tweet card media

Conditional probing: measuring usable information beyond a baseline

Probing experiments investigate the extent to which neural representations make properties -- like part-of-speech -- predictable. One suggests that a representation encodes a property if probing...

3

10

82

@johnhewtt

John Hewitt

4 years

Very thankful for the chance to give this talk! Students interested in understanding neural representations of language, I’d love if you came and gave your thoughts and perspectives on this ongoing work on the probing methodology.

@NLPwithFriends

NLP with Friends

@NLPwithFriends

4 years

We are very excited to announce our next speaker!! 🗣John Hewitt @johnhewtt talking with us about ❓"Language Probes as V-information Estimators" 🗓Sept 9nd, 14:00 UTC 📝Sign up here:

1

22

85

1

11

77

@johnhewtt

John Hewitt

4 years

It's #emnlp2020 and one of the best parts of a conf is meeting new people. If you'd like to chat #nlproc , and especially if you didn't have the money to sign up for the conference, email me to chat for 30min! I can talk research, admissions, grad school++. email on my website!

0

0

74

@johnhewtt

John Hewitt

5 years

So a lot of people have arrived here; please read @nsaphra 's excellent take on neural net probes and @nelsonfliu 's comprehensive neural net probing study, both also at #naacl2019 Saphra: Liu:

@nsaphra

Naomi Saphra

5 years

I'm still prepping the camera-ready for my @naacl paper, but if people take away one thing, I want it to be that they should be specific in what they mean when they say a representation "encodes" some linguistic property, and to recognize the drawbacks of their definition.

Tweet media one

Tweet media two

Tweet media three

Tweet media four

3

10

86

2

8

47

@johnhewtt

John Hewitt

7 months

Come see this panel I'll speak on! There's so much to understand about language models that it's a good thing we have multiple rich subcommunities with differing perspectives and expertise -- this panel will facilitate sharing ideas and refining goals.

@BlackboxNLP

BlackboxNLP

7 months

BlackboxNLP will this year feature a panel discussion on "Mechanistic Interpretability". We hope this panel may serve as a way of creating stronger bridges between interpretability in NLP and MI! We are now collecting questions for the discussion here:

2

8

44

0

4

47

@johnhewtt

John Hewitt

3 years

In analysis of neural nets, there’s no single right way to “probe” the neural net’s representations. In this opinion piece, we draw from neuroscience to enumerate a few distinct goals of probing and how each guides the design of the probe.

@neuranna

Anna Ivanova

3 years

Check out our short opinion piece where we draw parallels between investigating brains and neural nets! "Probing artificial neural networks: insights from neuroscience" Written with @NogaZaslavsky and @JohnHewtt for the #brain2AI #ICLR2021 workshop. 1/

2

26

89

1

6

46

@johnhewtt

John Hewitt

5 years

I'll be excitedly yammering about structural probes and finding syntax in unsupervised representations today at 4:15 in Nicollet B/C #naacl2019 . Even if you don't ❤️ parse trees, come by to learn a method to tell if your neural network softly encodes tree structures!

0

4

39

@johnhewtt

John Hewitt

3 years

I'm so glad this content is now freely available. As head TA this pas year, I had the privilege of writing and giving 3 lectures: on self-attention & Transformers, pretraining, and model analysis & explanation. I hope many find them useful in their studies!

@stanfordnlp

Stanford NLP Group

3 years

Looking for a series to binge-watch with more depth? We are delighted to make available the latest CS224N: Natural Language Processing with Deep Learning. New content on transformers, pre-trained models, NLG, knowledge, and ethical considerations. #NLProc

Tweet media one

1

42

183

0

3

39

@johnhewtt

John Hewitt

5 years

I enjoyed chatting with @waleed_ammar and @nlpmattg on #nlphighlights about my paper with @chrmanning on finding syntax in word representations. I'm very grateful to have had this opportunity to talk (at length!) about my work!

@waleed_ammar

Waleed Ammar

5 years

#nlphighlights 88: John Hewitt @johnhewtt talks about probing word embeddings for syntax by projecting to a vector space where the L2 distance between a pair of tokens approximates the number of hops between them in the dependency tree.

4

20

75

0

4

39

@johnhewtt

John Hewitt

1 year

This is a nice paper: On the (un)reliability of feature visualizations [Geirhos et al] Shows that vision model feature visualizations don't pass some sniff checks -- they can show plausible things unrelated to behavior on real inputs.

2

7

39

@johnhewtt

John Hewitt

5 years

I'm giving a talk on designing and interpreting probing methods for understanding neural representations at EMNLP, Hall 2C, today at 1:30!

1

0

37

@johnhewtt

John Hewitt

4 years

We split the problem of extrapolation to lengths not seen at train time in NNs into 1. what content to generate? 2. where to put EOS? Give up on 2 and NNs learn very different dynamics; better at 1! BlackBoxNLP Ben Newman, me, @percyliang @chrmanning

Tweet media one

1

8

37

@johnhewtt

John Hewitt

9 months

LMs make low-rank distributions (hidden dim < vocab_size) -> unavoidable errors! But samples are great if you use nucleus/top-k sampling 🤔. Matt: truncation sampling can fix low-rank errors, AND we can use the low-rank basis to find good tokens below the truncation threshold!

Tweet media one

@mattf1n

Matthew Finlayson

9 months

Nucleus and top-k sampling are ubiquitous, but why do they work? @johnhewtt , @alkoller , @swabhz , @Ashish_S_AI and I explain the theory and give a new method to address model errors at their source (the softmax bottleneck)! 📄 🧑‍💻

Tweet media one

3

26

167

0

5

30

@johnhewtt

John Hewitt

4 years

Congratulations to Ben Newman, who spearheaded the work, for winning Outstanding Paper at #BlackBoxNLP , and thanks to the organizers and reviewers for your efforts! Congrats as well to the winners of the other Outstanding Paper award!

0

4

29

@johnhewtt

John Hewitt

5 years

@chrmanning Key idea: Vector spaces have distance metrics (L2); trees do too (# edges between words). Vector spaces have norms (L2); rooted trees do too (# edges between word and ROOT.) Our probe finds a vector distance/norm on word representations that matches all tree distances/norms 2/4

2

1

28

@johnhewtt

John Hewitt

5 years

This claim, that parse trees are embedded through distances and norms on your word representation space, is a structural claim about the word representation space, like how vector offsets encode word analogies in word2vec/GloVE. We hope people have fun exploring this more! 4/4

3

1

24

@johnhewtt

John Hewitt

1 year

My favorite deeper dive experiment in this paper: we wondered if putting the question _before_ the documents would remove the U-shaped effect, since the autoregressive contextualization would "know" what info to look for when processing each doc. Nope! The trend still holds.

1

1

23

@johnhewtt

John Hewitt

3 years

It’s #naacl2021 and one of the best parts of a conf is meeting new people. If you’d like to chat #nlproc , and especially if you couldn’t make it to the conference, email me to chat for 30 min! I can talk research, admissions, grad school++. email on my website!

1

2

20

@johnhewtt

John Hewitt

1 year

We’re coming to the end of #cs224n and it’s so good to see students excitedly discussing the results of their work at the end of the quarter. I’m grateful to our 28 TAs for making the course work.

@stanfordnlp

Stanford NLP Group

1 year

The #cs224n poster session is happening now! We are super excited about amazing, cutting-edge NLP posters from ~650 students!

Tweet media one

Tweet media two

Tweet media three

Tweet media four

2

10

56

0

1

18

@johnhewtt

John Hewitt

1 year

Ruth-Ann's great work building a Jamaican Patois Natural Language Inference dataset was picked up by Vox as part of its video "Why AI doesn’t speak every language." Happy to see Ruth-Ann's work (and disparities in NLP across languages) get this general audience coverage.

@ruthstrong_

Ruth-Ann Armstrong

1 year

Check out this Vox video I was featured in where I chat about JamPatoisNLI which I worked on with @chrmanning and @johnhewtt ! Many thanks to @PhilEdwardsInc for platforming our work

2

16

38

0

5

17

@johnhewtt

John Hewitt

8 months

I'm also deeply committed to how open research dovetails with open teaching. I've twice co-taught Stanford's CS 224n: Natural Language Processing with Deep Learning; you can find some of my lectures here and here !

Tweet card media

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 9 - Pretrai...

For more information about Stanford's Artificial Intelligence professional and graduate programs visit: https://stanford.io/aiThis lecture covers:1. A brief ...

www.youtube.com

0

0

18

@johnhewtt

John Hewitt

4 years

Excited to give this talk! Tidbits: 1) Could finite-precision RNNs implement (bounded) stacks without access to an external stack? Yes, efficiently! 2) We train probabilistic models in NLP but prove things about acceptors; what if we connect language models to formal languages?

@USC_ISI

USC ISI

4 years

Join John Hewitt @johnhewtt , computer science PhD student at @Stanford , for his talk on November 12 at 11am entitled "The Unreasonable Syntactic Expressivity of RNNs." Details can be found here:

0

0

8

0

1

16

@johnhewtt

John Hewitt

5 years

These distances/norms reconstruct each tree, and are parametrized only by a single linear transformation. What does this mean? In BERT, ELMo, we find syntax trees approximately embedded as a global property of the transformed vector space. (But not in baselines!) 3/4

1

2

15

@johnhewtt

John Hewitt

1 year

Backpacks are an alternative to Transformers: intended to scale in expressivity, yet provide a new kind of interface for interpretability-through-control. A backpack learns k non-contextual sense vectors per subword, unsupervisedly decomposing the subword's predictive uses.

Tweet media one

1

3

14

@johnhewtt

John Hewitt

4 years

Just a few minutes out! Come attend or watch the livestream, or reach out to me afterward if you couldn’t attend but would like to chat about the topic!

@NLPwithFriends

NLP with Friends

@NLPwithFriends

4 years

We are very excited to announce our next speaker!! 🗣John Hewitt @johnhewtt talking with us about ❓"Language Probes as V-information Estimators" 🗓Sept 9nd, 14:00 UTC 📝Sign up here:

1

22

85

1

2

14

@johnhewtt

John Hewitt

4 years

This work, with @mhahn29 , @SuryaGanguli , @percyliang , @chrmanning , has been a fascinating and challenging new direction for me, and I'm deeply appreciative to them for enabling me to pursue it. Construction code: Learning code:

0

1

13

@johnhewtt

John Hewitt

8 months

My work has discovered structure in language models - through the structural probe (), refined probing methods (), and formalizing how models construct usable information about the solutions to hard problems ().

Tweet media one

3

0

13

@johnhewtt

John Hewitt

1 year

To represent a word in context, Backpacks use information from the whole context to non-negatively weight the senses of all subwords in the context. So, the contribution of each sense is always towards predicting the same words; only the magnitude changes.

Tweet media one

1

2

12

@johnhewtt

John Hewitt

4 years

Exciting work at #acl2020nlp in characterizing cross-lingual syntactic structure in multilingual BERT! Congrats Ethan!

@ethanachi

Ethan Chi

4 years

Does Multilingual BERT share syntactic knowledge cross-lingually? In #acl2020nlp paper w/ @johnhewtt and @chrmanning , we visualize its syntactic structure & show it's applicable to a variety of human languages. Paper: Blog: (1/4)

Tweet media one

2

34

119

0

0

12

@johnhewtt

John Hewitt

2 years

I’m most interested in in-depth lectures or technical explainers, less interested in surface-level introductions. I’m also focused on (arguably) newer topics, since in these cases, I think personal opinions on the topics tend to come through stronger in pedagogical materials.

0

1

12

@johnhewtt

John Hewitt

3 years

In my blog post, I argue that probing is a clear tool to characterize knowledge in neural networks when we didn't tell the network how to represent that knowledge. The code should be very useful for probing studies!

Tweet card media

GitHub - john-hewitt/conditional-probing: Codebase for running (conditional) probing experiments

Codebase for running (conditional) probing experiments - john-hewitt/conditional-probing

1

2

11

@johnhewtt

John Hewitt

4 months

By instead initializing new embedding to the average of existing embeddings, you guarantee that the partition function of the softmax grows by at most 1/n where n is the initial vocab size--- so the distrib doesn't change much!

0

0

10

@johnhewtt

John Hewitt

8 months

Further, we can and must design LMs for our understanding, not just performance: I introduced the Backpack, an architecture that brings many of the control and understanding benefits of linear models and word2vec with the power of the Transformer. ()

1

2

11

@johnhewtt

John Hewitt

3 years

I think there's a space of interesting work (and future work) around initializing new word embeddings (e.g., for domain adaptation) using more information -- about orthography, about the finetuning distribution, etc.; averaging will be a baseline to beat.

0

2

10

@johnhewtt

John Hewitt

1 year

I’m deeply thankful to my co-authors on this work, @jwthickstun @chrmanning @percyliang . ArXiv! Demos here! By Lora Xie. Huggingface!

0

0

10

@johnhewtt

John Hewitt

4 months

e.g., >>> torch.max(model(tok('I like pizza', return_tensors='pt')['input_ids']).logits) tensor(-4.5862) So, if you add a new word, since you randomly init the embedding, it gets dot product ~0 with hidden states. Softmax([-4,-4,..., 0]) puts mass on the elt with 0!

1

0

9

@johnhewtt

John Hewitt

6 years

We modeled derivational morphological transformations separately as orthographic and distributional functions, then combined: go see @_danieldeutsch present our paper on English derivational morphology in oral session 6D today at ACL!

0

2

9

@johnhewtt

John Hewitt

1 year

For one example, we observe that certain aspects of gender bias in career nouns (e.g., nurse, CEO) is represented by a Backpack in a particular sense vector (pointing towards, e.g., “he”, “she”.) By “turning down” this sense, this aspect of gender bias is reduced.

Tweet media one

1

0

9

@johnhewtt

John Hewitt

3 years

This can harm finetuning. I also show that a simple, popular heuristic -- just averaging all existing embeddings -- guarantees that adding new words doesn't deviate much from the pretrained LM, solving this problem.

Tweet media one

1

0

8

@johnhewtt

John Hewitt

4 years

Our results are about what's possible, not what's learned. But a drop of empirical results: while RNNs don't learn Dyck-k in practice (), they can learn Dyck-(k,m) well, even with a vanishingly small fraction of the possible stack states seen at training!

1

0

8

@johnhewtt

John Hewitt

2 years

We analyze a few truncation-sampling algorithms, and find that our eta-sampling leads to more plausible long English documents, breaks out of repetition better, and more reasonably truncates low-entropy distributions. With @chrmanning , @percyliang Blog:

Tweet media one

0

0

7

@johnhewtt

John Hewitt

4 years

Our proof is constructive, exactly specifying weights of 1-layer RNNs (and a separate mechanism for the LSTM using just gates) that allow RNNs to push/pop from internal stack and create probability distributions over the next token, encoding what's possible in Dyck-(k,m) strings.

1

0

7

@johnhewtt

John Hewitt

4 years

To make this concrete, k: vocab size. m: max nesting depth. Let's say vocab size is 100k, and max nesting depth is 3. (Empirically, 3 is not a bad approx. of human language.) Then before: approx 10^20 hidden units needed (give or take a few powers). We prove 150 units suffices.

1

0

7

@johnhewtt

John Hewitt

1 year

I was fascinated at the emergent structure of sense vectors, and I’m really excited to see what LM interpretability research the Backpack enables. We can design architectures that scale and learn to do some of the interpretability work for us.

1

0

7

@johnhewtt

John Hewitt

2 years

This work, led by @ruthstrong_ , provides a great new language resource in Jamaican Patois, and studies transfer in multilingual and monolingual LMs! One opportunity: studying how model predictions change as a sentence moves closer to or farther from the high-resource English.

@stanfordnlp

Stanford NLP Group

2 years

JamPatoisNLI provides a dataset and examines how well you can do transfer to low-resource creoles like Jamaican Patois, versus other recent results for low-resource NLI. By @ruthstrong_ @johnhewtt @chrmanning . At Multilingual Rep’n Workshop. #emnlp2022

Tweet media one

0

8

17

0

4

7

@johnhewtt

John Hewitt

4 months

@Teknium1 Right; it's pretty rare. Most models don't have this issue. You can see for yourself by loading up gemma-2b: >>> torch.max(model(tok('I like pizza', return_tensors='pt')['input_ids']).logits) tensor(-4.5862) bc max logit << 0, any new token would dominate in probability

1

0

6

@johnhewtt

John Hewitt

2 years

I’ll be on this panel! Come say hello!

@aclmentorship

ACL Mentorship

2 years

Exciting update: we are opening up our upcoming mentoring session to the public. Use this link to join the webinar on Wednesday:

1

5

17

0

2

6

@johnhewtt

John Hewitt

2 years

On Dec 7, I'll be presenting Truncation Sampling as Language Model Desmoothing At the GEM workshop! One practical takeaway: word-level truncation decisions (from top-p or eta-sampling) can be unintuitive. A colab in which you can try these yourself:

Tweet media one

0

2

6

@johnhewtt

John Hewitt

5 years

@yoavgo re: other languages: I expect so : ) ; we'll see. Some things in the works; lots of follow-up work to be done (hopefully by many people!) re: syntax reps and head choices -- I'd love to hear more about that! Which representations (UD/SD/?) do ELMo/BERT match best? etc.

2

0

6

@johnhewtt

John Hewitt

4 years

Human languages exhibit hierarchical structure, but don't require infinite memory to parse. We study a formal language that reflects this. Bound the memory requirements of Dyck-k (well-nested brackets of k types): you get Dyck-(k,m) (at most m brackets in memory at any time.)

1

0

6

@johnhewtt

John Hewitt

4 years

A simple communication complexity argument proves that O(m log k) hidden units is optimal -- even with unbounded computation (!!), it's impossible to use asymptotically fewer. That is, RNNs are fascinatingly well-suited (imo) to handling bounded-memory hierarchy.

2

0

6

@johnhewtt

John Hewitt

4 years

RNNs can't generate Dyck-k because they have finite memory (with finite precision.) They can generate Dyck-(k,m) because it's a regular language. BUT known constructions would require k^m or so hidden units (exponentially many DFA states!) We prove that O(m log k) suffices!

1

0

6

@johnhewtt

John Hewitt

1 year

Rich decomposition of token meaning makes sense vectors an interesting target for interventions, and identical (in direction) contribution of token senses to log-predictions means that we can know how our interventions will affect any log-prediction of the model.

1

0

5

@johnhewtt

John Hewitt

2 years

Intuitively, in early-stopped neural LMs optimizing KL, there's good reason to put "a bit of probability mass everywhere", to hedge and avoid very high loss. This smoothing is good for scoring, like in n-gram models, but bad for generation, since mass is on non-language strings.

Tweet media one

1

0

5

@johnhewtt

John Hewitt

6 years

Scott Aaronson's note is a delightful introduction to reasoning about large numbers, leading up to the Busy Beaver numbers. Years after finding that article, what fun to find Busy Beaver numbers in proofs on RNNs!

0

2

5

@johnhewtt

John Hewitt

3 years

@chrmanning @MycalTucker @boknilev @stanfordnlp @ethayarajh @percyliang Adding to this: 𝒱-information can be 'constructed' by deterministic transformation (no data processing inequality) and 𝒱, a set of functions (linear or otherwise) is effectively a hypothesis as to how the property is encoded. Also cool: baseline can be any r.v., not just input!

2

0

5

@johnhewtt

John Hewitt

1 year

Interpretability is difficult to evaluate, and often the problems of a method are best shown by empirically demonstrating that certain basic intuitions (e.g., "images that maximally activate a feature relate to the behavior of that feature on real data") don't necessarily hold.

0

0

4

@johnhewtt

John Hewitt

4 years

@ssshanest @hawkrobe Thanks for the question! Short answer: no. Long answer: no, but only because the paper was already 32 pages long (incl. appendix) and focused on theory; I think it's a fascinating Q. My Q: do LSTMs learn something more like our RNN construction or our LSTM construction? Neither?

1

0

4

@johnhewtt

John Hewitt

3 years

@gchrupala @boknilev @jacobandreas @RTomMcCoy @wzuidema @neuranna @NogaZaslavsky @paulsoulos @tallinzen @ltorroba1 This risk of recovering the syntax tree, but only because it's correlated with the presence of particular words, can corrected for in via our 'conditional probing' for decoder studies; could make for interesting comparison ()

1

0

4

@johnhewtt

John Hewitt

5 years

@nsaphra @chrmanning We found positive results on ELMo similar to BERT (Table 1); for unidirectional models (RNN and transformer alike) it's still unclear (unpublished expts).

0

0

3

@johnhewtt

John Hewitt

3 years

Intuitively, this happens when the dot product of the new word embeddings with the neural representation of the prefix (zero-ish) is a large value relative to the other unnormalized probabilities (logits) of the language model -- seems to happen "by accident" sometimes.

1

0

3

@johnhewtt

John Hewitt

3 years

These new words won't be useful until finetuning, but I was surprised to see that for some models, adding the words actually causes the LM to put all its probability on only the new words.

1

0

3

@johnhewtt

John Hewitt

2 years

Our model of this leads to two principles for truncation to desmooth -- never truncate high-probability words, and never truncate words that are relatively high-probability for that particular distribution. Surprisingly, top-p breaks the first principle.

1

0

3

@johnhewtt

John Hewitt

5 years

@yoavgo Finally, on the rank experiments -- our intuitions are similar. English parse tree distance matrices full-rank-ish in the length of the sentence. unpublished: training struct probe on only shorter sentences still leads to rank 64 probe. Maybe fun connection with l1-embeddability?

1

0

3

@johnhewtt

John Hewitt

4 years

@zehavoc @stanfordnlp Thank you!!

0

0

3

@johnhewtt

John Hewitt

4 years

@ssshanest @hawkrobe Because we provide PyTorch implementations of the constructions (and code to train models), these analyses are possible to do right away! I'd love to see them done, but they're not my immediate todo. Would be happy to chat with anyone interested.

0

0

3

@johnhewtt

John Hewitt

2 years

Truncation sampling algorithms restrict the neural LM support (words with non-zero probability.) We interpret this as approximating the true distribution support and ask, words of what probability are likely to be in that support?

Tweet media one

1

0

3

@johnhewtt

John Hewitt

5 years

@yoavgo Totally agreed about controls for other trees and the extent to which they're encoded. Thinking about what the right kind of spurious trees that aren't easily recoverable from simple transform of linear structure + parse structure. Random needs some consistency to the test set.

2

0

3

@johnhewtt

John Hewitt

5 years

@percyliang We claim that good probes are "selective," achieving high accuracy on linguistic tasks, and low acc on control tasks. Between probes, small gains in linguistic acc can correspond to big selectivity losses; gains may be from added probe capacity, not repr properties.

1

0

2

@johnhewtt

John Hewitt

3 years

So what if BERT only captured the ambiguous POS cases? It would explain _less_ about POS than does the word identity. Probing would indicate this with a negative result. But really, in this case BERT captures an important part of POS! So, how do we measure it?

Tweet media one

1

0

2

@johnhewtt

John Hewitt

3 years

Further, we show that conditional probing actually can estimate the conditional _usable_ information from the representation to the property, after conditioning on the baseline! I(BERT->POS | word identity) So it's not just an intuitive tool; it has a clear interpretation.

1

0

2

@johnhewtt

John Hewitt

18 days

@shi_weiyan @columbianlp Thanks, Weiyan!

0

0

1

@johnhewtt

John Hewitt

6 years

Wondering under what circumstances visual signal is useful in translation? Feeling a desire for multimodal, multilingual NLP? Use our dataset of images representing words across 100 languages, and check out our poster in Session 3E with Daphne Ippolito

0

0

2

@johnhewtt

John Hewitt

18 days

@tallinzen Thank you, Tal! Looking forward to some NYC NLP socials!

0

0

2

@johnhewtt

John Hewitt

4 years

This was true of a bracket closing task and SCAN length extrapolation splits. On MT (De-En): no conclusive results. Bonus: can sometimes train a post-hoc model to predict EOS after the fact. A promising direction?? Congrats to Ben on spearheading a great analysis paper.

0

0

2

@johnhewtt

John Hewitt

3 years

Our conditional probing method does this, and it's as simple as training probe 1 on the baseline, and probe 2 on the concatenation of baseline and representation! Intuitively, now we measure just what "extra" the representation provides, not what it lacks.

Tweet media one

1

0

2

@johnhewtt

John Hewitt

3 years

@esalesk So, some thoughts: (1) averaging fewer words might be better in terms of having more semantically meaningful inputs, but (2) averaging more words gives you a closer approximation to the original LM. Could be a fun tradeoff to explore.

1

0

2

@johnhewtt

John Hewitt

2 years

Our eta-sampling both (1) never truncates a word that's higher-prob than a hyperparam threshold, and (2) never truncates a word higher-prob than an entropy-dependent threshold. Top-p, in contrast, truncates high-prob words (e.g., only allowing "is" for the prefix "My name")

Tweet media one

1

0

2

@johnhewtt

John Hewitt

3 years

If we’ve chatted like this at a previous conference, remind me when you reach out! I’d love to know how things have gone, and what advice or opportunities turned out to be useful, in part so I can improve how useful I can be to folks in the future.

0

0

2

@johnhewtt

John Hewitt

3 years

This work with fellow Stanford NLP PhD @ethayarajh , and my advisors @percyliang and @chrmanning ; my thanks to them for enduring a long and winding process with this paper, e.g., reformulating 𝒱-information to capture conditional information quantities (see Appendix!)

0

0

2

@johnhewtt

John Hewitt

5 years

@gattardi @chrmanning I'll plug the code while I'm at it -- pre-trained probe for BERT-large layer 16 (and easy interface to BERT) makes these quick questions easy to ask!

Tweet media one

Tweet media two

1

0

2

@johnhewtt

John Hewitt

5 years

Lots of hyperparameters when designing probes, and probing results conflate representation, probe, and data, making interpretration difficult. A control task can help design, and help interpret. code:

GitHub - john-hewitt/control-tasks: Repository describing example random control tasks for design...

Repository describing example random control tasks for designing and interpreting neural probes - john-hewitt/control-tasks

0

0

2