Samuel Marks @saprmarks Twitter profile

Pinned Tweet

Samuel Marks

5 months

Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager , @ericjmichaud_ , @boknilev , @davidbau , @amuuueller

7

61

308

Last Seen Profiles

@Emre_pinar3330

@orlandinhoog

@FCB_PANDA

@TourlifeKing

@ChranCha73

@EUCouncil

@NLActiveSchools

@sirotetyou

@BinorRaja

@bengs_bbq

@sige1225

@ouosuzu

@DP_HIIRO

@Lovebreak69

@salihuzer96

@saltlutos

@4OaC3XfOJc4lm8Y

@servelunablack

@Mar0ie

@mero_chan1414

@stw_pdg

@GCCStat

@ririleann

@rustbeltjacobin

@jandakembangstw

@Racso_RF

@toontoet

@sojci83

@athanasiad8

@flowasthetao2

@PhcNguy63336007

@Inscription133

@rmfl_1557

@vothuanvillage

@LucyMorrisey

@senkaishinobu

Samuel Marks

@saprmarks

3 months

I had the great pleasure of learning about this about 30mins before the rest of the world when I arrived today to my first day of work at @AnthropicAI and Jan was sitting next to me.

Jan Leike

@janleike

3 months

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

370

524

9K

3

1

384

Samuel Marks

@saprmarks

10 months

Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"? In a new paper with @tegmark , we explore how LLMs represent truth. 1/N

10

67

306

Samuel Marks

@saprmarks

8 months

I've made a post here about a surprising observation about LLM representations: LLMs seem to linearly represent XORs of arbitrary features, even when there's no reason to do so. I also write about the consequences this has for interp research

7

26

259

Samuel Marks

@saprmarks

2 months

@ericneyman (I got this question wrong.)

2

0

183

Samuel Marks

@saprmarks

4 months

When working on interpretability research, I face two main practical challenges: 1. Tooling: I sometimes feel like a neurosurgeon with dull scalpels 2. Models too big to fit on my machine I'm excited about nnsight & NDIF, a pair of projects aimed at addressing these problems! 🧵

1

11

117

Samuel Marks

@saprmarks

27 days

Spotted in a paper draft. Now I can't stop fantasizing about writing Marks & Engels (2024). Want to team up @JoshAEngels ?

3

1

92

Samuel Marks

@saprmarks

2 months

Idea: make sure AIs never learn dangerous knowledge by censoring their training data Problem: AIs might still infer censored knowledge by "connecting the dots" between individually benign training documents! @j_treutlein and @damichoi95 formalize this phenomenon as "inductive

Owain Evans

@OwainEvans_UK

2 months

New paper, surprising result: We finetune an LLM on just (x,y) pairs from an unknown function f. Remarkably, the LLM can: a) Define f in code b) Invert f c) Compose f —without in-context examples or chain-of-thought. So reasoning occurs non-transparently in weights/activations!

33

213

2K

1

7

70

Samuel Marks

@saprmarks

4 months

Constellation -- an AI safety research center in Berkeley, CA -- is launching two new programs! * Visiting Fellows: 3-6 months visiting (w/ travel, housing, & office space covered) * Constellation Residency: 1yr salaried position

1

6

51

Samuel Marks

@saprmarks

6 months

Here's a feature from a Pythia-70m attention head found with an SAE and noticed by my collaborator @amuuueller a few months ago. Months later, it's still my favorite feature. Can you figure out what it's doing?

2

6

44

Samuel Marks

@saprmarks

1 month

Many people cite "Interpretability for finding adversarial inputs" as the most plausible path-to-impact for interpretability. So I'm very excited to see @loganriggssmith & @BrinkmannJannik 's work using interp to red-team preference models!

Interpreting Preference Models w/ Sparse Autoencoders — LessWrong

This is the real reward output for an OS preference model. The bottom "jailbreak" completion was manually created by looking at reward-relevant SAE f…

www.lesswrong.com

1

3

41

Samuel Marks

@saprmarks

10 months

One fun tidbit is that this linear structure emerges over layers: true and false statements separate earlier for simple statements, and later for complicated ones (e.g. "It is the case both that the city of Chicago is in Madagascar and that the city of Beijing is in China"). 3/N

1

2

40

Samuel Marks

@saprmarks

10 months

By the way, if you like these visualizations, you can explore these datasets for yourself at our interactive dataexplorer: 4/N

The Geometry of Truth: Dataexplorer

saprmarks.github.io

1

2

35

Samuel Marks

@saprmarks

2 months

Alice: RL agents might learn to directly hack their reward function to always output +1000 Bob: Is that even possible? Why would the agent explore such a crazy action in the first place? This paper: b/c of generalization from easier-to-discover forms of specification gaming

Anthropic

@AnthropicAI

2 months

New Anthropic research: Investigating Reward Tampering. Could AI models learn to hack their own reward system? In a new paper, we show they can, by generalization from training in simpler settings. Read our blog post here:

22

184

980

0

3

36

Samuel Marks

@saprmarks

8 months

My two top takeaways are: if models linearly represent XORs of arbitrary features, then... 1. this implies qualitatively new reasons that linear probes might fail to generalize. E.g. probe directions can be affected by irrelevant feature which are *constant* in training data!

1

30

Samuel Marks

@saprmarks

9 months

This is one of my favorite AI safety papers of this year! The authors train an LLM to imitate the systematic mistakes of a user (Bob, who is bad at addition). Then they show that, even though the LLM lies to Bob, we can still extract truthful answers from its internals.

Alex Mallen

@alextmallen

9 months

New @AiEleuther paper on Eliciting Latent Knowledge! We finetune "quirky" LMs to make systematic errors on arithmetic problems when "Bob" is in the prompt, and show that we can recover their knowledge of the correct answer from their activations. 📄

3

19

157

1

31

Samuel Marks

@saprmarks

4 months

Can you figure out how many interacting circuits are involved in a behavior just by looking at loss curves? Maybe! In this cool paper, @Aaditya6284 et al. study the emergence of circuits in isolation by retraining models with certain activations "clamped" to post-training values

Aaditya Singh

@Aaditya6284

5 months

In-context learning (ICL) circuits emerge in a phase change... Excited for our new work "What needs to go right for an induction head (IH)?" We present "clamping", a method to causally intervene on dynamics, and use it to shed light on IH diversity + formation. Read on 🔎⏬

2

44

198

0

3

30

Samuel Marks

@saprmarks

10 months

For example, before neurosurgery, our LLM was 72% confident that the statement "The Spanish word 'uno' means 'floor'" is false. But by identifying where the LLM stores this info and overwriting it, we can make the LLM 70% confident that the statement is true! 9/N

1

29

Samuel Marks

@saprmarks

5 months

Thrilled to release this preprint (along with my wonderful coauthors!). Stay tuned for our paper thread. And thanks to @StephenLCasper for loudly and insistently advancing his arguments that MI should have use cases -- they were very influential on this work!

Cas (Stephen Casper)

@StephenLCasper

5 months

I've complained about mech interp for years, but I have no complaints about this. If circuits-style MI is going to be useful and competitive for real-world diagnostics/debugging (a big 'if'), I think this is the kind of work that may make this possible.

3

26

170

0

3

27

Samuel Marks

@saprmarks

10 months

Step 1 in our investigation is building datasets of simple, unambiguous true/false statements and visualizing LLM representations of these statements. We see clear linear structure, with true statements separating from false ones! 2/N

1

28

Samuel Marks

@saprmarks

22 days

We want SAEs which interpretably capture "everything the model knows" about its input. But how do we get ground truth on "everything the model knows"? Fun paper with @a_karvonen Ben Wright @can_rager & others! We study board-game-playing LLMs; these models must mentally track an

Adam Karvonen

@a_karvonen

22 days

Our paper—which we presented as an 📢 Oral Presentation 📢 at the ICML Mech Interp workshop—swaps natural language for board games, a setting where we can evaluate Sparse Autoencoders (SAEs) more rigorously! We also introduce p-annealing, a new SAE training method. 🧵

3

12

72

1

2

26

Samuel Marks

@saprmarks

10 months

Given these observations - and following prior work by @CollinBurns4 , @JacobSteinhardt , and others - we ask: do LLMs represent the truth or falsehood of statements along a single "truth direction"? We provide evidence that the answer is yes! (And set SoTA on finding it) 5/N

1

4

26

Samuel Marks

@saprmarks

10 months

E.g. a truth direction found using only statements of the form "x is larger/smaller than y" is 97% accurate at classifying statements about Spanish-English translation, like "The Spanish word 'gato' means 'cat'."! 7/N

1

25

Samuel Marks

@saprmarks

5 months

How do we discover circuits on these sparse features? We fold sparse autoencoders into the LM’s computation, and use attribution patching to quickly estimate each feature’s contribution to the LM’s output.

2

24

Samuel Marks

@saprmarks

8 months

2. Interp work needs to either (a) explain how these XOR feature directions are different from the rest, or (b) be robust to an exponential increase in the number of features in models.

2

1

22

Samuel Marks

@saprmarks

10 months

Rather, we show that a dead-simple alternative works better: just take the direction pointing from the mean of the false datapoints to the mean of the true datapoints. That's it! We show that these "mass-mean" directions work better than LR, especially for neurosurgery. 13/N

1

23

Samuel Marks

@saprmarks

10 months

What evidence? And what does it even mean to be a "truth direction"? We show two things: 1. Directions extracted from one true/false dataset are accurate for classifying true/false statements from structurally and topically diverse datasets. 6/N

1

22

Samuel Marks

@saprmarks

4 months

In a new blog post, I explain how our work in Sparse Feature Circuits fits into a broader AI safety agenda. I also introduce a concrete problem, Discriminating Behaviorally Indentical Classifier (DBIC), for guiding future interpretability work.

Samuel Marks

@saprmarks

5 months

Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager , @ericjmichaud_ , @boknilev , @davidbau , @amuuueller

7

61

308

1

3

21

Samuel Marks

@saprmarks

10 months

In many ways, this paper could be viewed as a companion paper to @wesg52 and @tegmark 's recent paper on LLM space/time representations. Both ask "Do LLMs have world models? And how do we find them?" Wes focuses on space and time, and I focus on truth. 15/N

Wes Gurnee

@wesg52

11 months

Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales? In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!

182

1K

6K

2

1

20

Samuel Marks

@saprmarks

10 months

For me, the most exciting part is our progress on extracting a truth direction from labeled true/false datasets. But first, I hear the skeptics: "LMs are just statistical engines with no notion of truth! You're probably detecting text which is likely/unlikely, not true/false"

1

20

Samuel Marks

@saprmarks

10 months

Okay, let's talk truth direction extraction! The most common way researchers have done this is with logistic regression. But based on geometrical issues arising from the superposition hypothesis of @ch402 and others, we think logistic regression is actually quite bad here! 12/N

1

20

Samuel Marks

@saprmarks

4 months

You're given two models: one is aligned, and one pretends to be aligned when watched. Can you tell the difference? Cool work by @joshua_clymer @kh4dien and @sevdeawesome ! (Aside: I'm proud to have suggested the top-performing "get the models drunk" method to the authors!)

Joshua Clymer

@joshua_clymer

4 months

Can we catch models that fake alignment by examining their internals? Our new paper trains 'sleeper agents' that strategically pretend to be safe and finds methods to unmask them.

7

13

96

0

1

20

Samuel Marks

@saprmarks

4 months

As new interp techniques applied to larger models produce an explosion in human-interpretable data for analysis, we'll need new ways to scale up our interp workforce. IMO @cogconfluence et al are on exactly the right track here.

Sarah Schwettmann

@cogconfluence

4 months

MAIA (A Multimodal Automated Interpretability Agent) is here! 🧵 📝New paper: 🌐Website: Agents like MAIA advance automated interpretation of AI systems from one-shot feature description into an interactive regime where hypotheses

5

41

148

0

17

Samuel Marks

@saprmarks

10 months

2. Even better, we can use our identified truth directions to perform **LLM neurosurgery**, getting our LLM to treat false statements as true and vice versa. 8/N

1

16

Samuel Marks

@saprmarks

5 months

One more thank you: thanks to the Neuronpedia team @JBloomAus and @johnnylin ! Going from my first message to them ("Crazy ask, but...") to hosting our SAEs a few days later, they moved incredibly quickly

2

0

16

Samuel Marks

@saprmarks

10 months

BTW, I should shout out the excellent prior work by @ke_li_2021 and @_oampatel_ . Their "Inference-Time Intervention" paper was a major source of inspiration (and lent the name to "mass-mean probing")

1

16

Samuel Marks

@saprmarks

10 months

This is a reasonable concern, and we check for it in a few ways. 1. Constructing datasets in which true text diverges from probable text. E.g. Our LLM judges "China is NOT a country in" to mostly likely end "Asia." (See paper for more) 2. Our neurosurgery experiments from above.

2

1

16

Samuel Marks

@saprmarks

10 months

One fun obstacle to extracting truth directions: truth directions you get from different datasets sometimes look very different (e.g. are orthogonal!) Our experiments suggest why: confounding features that correlate inconsistently with truth. Solution: use more diverse data 14/N

2

1

14

Samuel Marks

@saprmarks

4 months

Let's start with nnsight. You can think of nnsight as an easier-to-read-and-write alternative to pytorch hooks. E.g., see the difference between saving an MLP activation with forward hooks vs. nnsight. The nnsight code is minimal and elegant, a tool that doesn't get in my way.

1

0

14

Samuel Marks

@saprmarks

5 months

Some interactive demos: 🔍 Browse auto-discovered model behaviors and circuits: 🧠 Explore our SAE features on Neuronpedia: E.g. see this Neuronpedia profile for one of our fav features!

Samuel Marks

@saprmarks

6 months

Here's a feature from a Pythia-70m attention head found with an SAE and noticed by my collaborator @amuuueller a few months ago. Months later, it's still my favorite feature. Can you figure out what it's doing?

2

6

44

1

0

14

Samuel Marks

@saprmarks

5 months

Lots of prior work on robustness to spurious signals, from @polkirichenko , @Pavel_Izmailov , @shiorisagawa , @OrgadHadas , @norabelrose , @ravfogel , @AdtRaghunathan , & many more Our work is different: it works even if the spurious & intended signals are equally predictive of labels

1

0

13

Samuel Marks

@saprmarks

5 months

But LMs have many difficult-to-anticipate behaviors and mechanisms. Can we get circuits for those as well? Yes! By combining our method with @ericjmichuad_ / @tegmark ’s quanta discovery methods, we automatically discover thousands of feature circuits from raw text data.

Eric J. Michaud

@ericjmichaud_

1 year

Our method, which we call QDG for "quanta discovery from gradients" is based on spectral clustering with model gradients. With QDG, we auto-discover a variety of capabilities/behaviors of a small language model. Here are a couple of clusters:

1

0

14

1

13

Samuel Marks

@saprmarks

2 months

@RichardMCNgo Biggest complaints: 1. Really needs to have better fast travel 2. Impending deadline gives a stressful Sword of Damocles feel, discourages side quests 3. Illness: adds absolutely nothing, who requested this feature??

1

0

13

Samuel Marks

@saprmarks

5 months

See our project website, paper, and github repo for more! 🌐 📜 🧑‍💻

GitHub - saprmarks/feature-circuits

Contribute to saprmarks/feature-circuits development by creating an account on GitHub.

github.com

2

0

13

Samuel Marks

@saprmarks

5 months

Feature circuits consist of fine-grained, interpretable units, and are useful downstream! For example, we can fully debias a profession classifier, even if profession and gender are perfectly correlated in all our classification data.

1

11

Samuel Marks

@saprmarks

5 months

There’s a lot of great work from @ch402 , @NeelNanda5 , @ArthurConmy , @rowankwang et al. on discovering LM circuits. But these circuits are difficult to apply b/c they localize LM behaviors to MLPs/attn heads/neurons, which are difficult to interpret and play many roles at once

1

0

11

Samuel Marks

@saprmarks

5 months

Instead, we follow @HoagyCunningham @TrentonBricken et al., and use sparse autoencoders to automatically discover meaningful directions to build circuits with.

Anthropic

@AnthropicAI

11 months

Dictionary learning works! Using a "sparse autoencoder", we can extract features that represent purer concepts than neurons do. For example, turning ~500 neurons into ~4000 features uncovers things like DNA sequences, HTTP requests, and legal text. 📄

6

99

871

1

0

11

Samuel Marks

@saprmarks

5 months

What “interpretable units” are our circuits are built from? If we knew what the “right concepts” were ahead of time, we could use methods like DAS (from @ChrisGPotts @noahdgoodman @ZhengxuanZenWu ) to find corresponding directions in our LM’s latent space. But we don’t!

Zhengxuan Wu

@ZhengxuanZenWu

1 year

📣How does 🔥Alpaca🦙 follow your instructions? Mechanistic interpretability at scale – our new paper identifies the causal mechanisms the Alpaca 7B model uses to solve simple reasoning tasks (with Atticus Geiger, @ChrisGPotts , and @noahdgoodman !) Paper:

2

60

251

1

0

11

Samuel Marks

@saprmarks

5 months

Finally, thanks to @loganriggssmith and @BrinkmannJannik for help training SAEs and to @StephenLCasper and @NeelNanda5 among others for useful discussion!

1

0

10

Samuel Marks

@saprmarks

8 months

Much more in the post.

0

1

10

Samuel Marks

@saprmarks

5 months

For example, here’s a circuit we discovered for subject-verb agreement. The features in the circuit are easy to interpret and illustrate an intuitive algorithm.

1

0

10

Samuel Marks

@saprmarks

4 months

FAQ: Is this meant to replace @NeelNanda5 's TransformerLens? A: No! nnsight tries to provide a general-purpose, model-agnostic interpretability API. TL tries to provide a unified approach for interp on transformers specifically. I view these as mostly complementary goals.

1

0

8

Samuel Marks

@saprmarks

4 months

NDIF currently hosts five models, including two Llama-3 models, and with plans to add more! If Meta open-sources a 400B+ parameter Llama-3 model, we're hoping NDIF will make it easy for a variety of researchers to study its internals.

1

0

9

Samuel Marks

@saprmarks

5 months

Feature circuits provide useful insights into automatically discovered LM behaviors. For example, while some behaviors seem like cohesive single mechanisms, circuits-level analysis reveals a union of distinct mechanisms, like multiple “narrow” induction patterns.

1

0

8

Samuel Marks

@saprmarks

6 months

@joshua_clymer I agree that there are non-model-internals-based sources of evidence about whether models are faking alignment. (In fact, I expect most of our evidence will come from such sources). That said, I don't love your examples here.

2

0

8

Samuel Marks

@saprmarks

10 months

Our alignment scheme for children (+ pets) is quite similar to RLHF: children do stuff, we look at what they do, and we reward them (with e.g. candy, praise) if we like what we see. This scheme suffers from the same issues I expect we'll encounter in aligning AI systems: 1/N

Andrew Ng

@AndrewYNg

10 months

Argh, my son just figured out how to steal chocolate from the pantry, and made a brown, gooey mess. Worried about how hard it is to align AI? Right now I feel I have better tools to align AI with human values than align humans with human values, at least in the case of my 2

139

67

1K

1

8

Samuel Marks

@saprmarks

6 months

@amuuueller Answer: it's activating on the final token of relative clauses or prepositional phrases which modify a sentence's subject!

2

0

7

Samuel Marks

@saprmarks

4 months

In fact, nnsight piggybacks on TL's great uniformization work via the UnifiedTransformer class, a crazy frankenstein mixture of nnsight and TL. You can write nnsight code which extracts activations from the standard TL hookpoints!

1

0

6

Samuel Marks

@saprmarks

4 months

Huge kudos to @jadenfk23 , who is currently the main manpower behind nnsight/NDIF, and never complains when I need tech support at 2am🙂 Thanks to @davidbau for having the vision for these projects See & to learn more/get involved

0

7

Samuel Marks

@saprmarks

4 months

Here’s another example, where I do a concise and readable replication of part of the ROME paper for LLaMA-2-7B, showing that patching the MLP layer 1-5 activations from "Eiffel Tower" to "Collosseum" changes the model's prediction for "The Eiffel Tower is in the city of..."

1

0

7

Samuel Marks

@saprmarks

6 months

@amuuueller @ch402 sometimes talks about his experience discovering high/low frequency detectors in image models: initially, they seemed messy and confusing, but are actually beautiful and structured once you understand what's going on. Just so with this feature :)

1

0

7

Samuel Marks

@saprmarks

10 months

For much, much more see our paper! paper: code: interactive dataexplorer: 16/16

The Geometry of Truth: Dataexplorer

saprmarks.github.io

2

1

6

Samuel Marks

@saprmarks

2 months

@RichardMCNgo 1. Navigating the development of a transformative technology is intellectually and strategically stimulating 2. Good replay value: lots of interesting player characters to replay as 3. Satisfying win condition: becoming immortal 4. One Earth-sized planet is a good gameboard size

1

0

6

Samuel Marks

@saprmarks

6 months

@laurolangosco @ohabryka @AnthropicAI It also depends on whether the spirit of the commitment is about (1) not creating hype, (2) not contributing to race dynamics, or (3) not actually advancing the frontier. This arguably does (1) (OTOH the announcement is pretty toned-down IMO)

1

0

6

Samuel Marks

@saprmarks

4 months

Okay, on to NDIF. NDIF is computing infrastructure for remote execution of white-box experiments on large models. You write nnsight code and it gets remotely executed on NDIF,. From your perspective, it's as if you had a giant model on your own machine!

David Bau

@davidbau

4 months

I am delighted to officially announce the National Deep Inference Fabric project, #NDIF . NDIF is an @NSF -supported computational infrastructure project to help YOU advance the science of large-scale AI.

9

62

279

1

0

6

Samuel Marks

@saprmarks

4 months

In cognition-based oversight, we evaluate models based on whether they perform intended cognition, not whether they produce intended outputs. E.g. a possible workflow here is: inspect a model's cognition in search of "red flags." This is how SHIFT works.

Samuel Marks

@saprmarks

5 months

Feature circuits consist of fine-grained, interpretable units, and are useful downstream! For example, we can fully debias a profession classifier, even if profession and gender are perfectly correlated in all our classification data.

1

11

1

6

Samuel Marks

@saprmarks

16 days

@RichardMCNgo Those all seem like the same concept to me, but I seem smart in conversations.

0

6

Samuel Marks

@saprmarks

4 months

One more example, this one a live piece of research code. In Sparse Feature Circuits, we inserted SAEs into models, but wanted a weird backward pass that e.g. pretended the SAEs weren't there. Solution: use nnsight to intervene on the backward pass!

1

0

5

Samuel Marks

@saprmarks

4 months

@StephenLCasper Clearly you might worry that this result is an artifact of the way the sleeper agent was trained. But based on "probing has a ton of challenges IRL," it sounds like you think this approach is deficient in a more fundamental way?

1

0

5

Samuel Marks

@saprmarks

9 months

@trevposts My current best guess is that the "sources close to Altman" were strategically leaking false or misleading info. In addition to the reasons you named, the leak probably also helps Sam's bargaining position for whatever the new venture is.

0

5

Samuel Marks

@saprmarks

4 months

nnsight supports any huggingface model (in fact, any torch.nn.Module). For example @arnab_api has used nnsight to study Mamba!

Arnab Sen Sharma

@arnab_api

4 months

Then we ask: can we apply ROME to update/insert a fact in Mamba? ROME can be applied directly as well! When applied to different projection matrices in MambaBlock across all the layers, we see that ROME can successfully update/insert a fact in a range of layers.

1

0

3

1

0

4

Samuel Marks

@saprmarks

1 month

@RichardMCNgo @trevposts @JeffLadish @jeremyphoward The way I would prefer to say this is: we have uncertainty over what experimental procedure the experimenter used (e.g. whether the procedure involved p-hacking), and the researcher's mindset provides evidence about the procedure used.

0

5

Samuel Marks

@saprmarks

4 months

For example, suppose I want to repeat my LLaMA-2-7B experiment from before with the 10x larger LLaMA-2-70B. Just swap out "7b" for "70b" and a "remote=True" argument

1

0

5

Samuel Marks

@saprmarks

1 month

@loganriggssmith @BrinkmannJannik FWIW I don't think interp for finding adversarial inputs is the *most impactful* interp application. But I'd guess it's the most likely one to pan out. I also think that research which gets interp some practical wins (even wins not on the most critical path) is worthwhile.

1

0

5

Samuel Marks

@saprmarks

1 month

@StephenLCasper And +1 to your enthusiasm about the excellent PM interpretability work—I hope it gets a lot more attention.

Interpreting Preference Models w/ Sparse Autoencoders — LessWrong

This is the real reward output for an OS preference model. The bottom "jailbreak" completion was manually created by looking at reward-relevant SAE f…

www.lesswrong.com

0

4

Samuel Marks

@saprmarks

10 months

@MaxNadeau_ @thomasahle I agree that this doesn't require planning, but I think it's at least some intuitive evidence for planning-adjacent cognition. Concretely, I bet that the information that the model intends to output "orange" is already present when the model outputs a/an.

1

0

4

Samuel Marks

@saprmarks

11 months

I had a great time speaking at @MITFutureTech on my recent paper with @tegmark (Twitter thread on the paper coming soon!)

MIT FutureTech

@MITFutureTech

11 months

Samuel Marks discussing research with @tegmark exploring whether language models linearly represent the truth or falsehood of factual statements. #MITAIScaling @MITFutureTech

0

3

0

3

Samuel Marks

@saprmarks

9 months

@ch402 When I was teaching linalg + multivariate calculus at Harvard in Spring 2022, **PCA** wasn't even part of the curriculum! I indeed expect this to change, with more data science/DL applications and less differential equations (~half the course when I was teaching!)

0

2

Samuel Marks

@saprmarks

6 months

@joshua_clymer 2. Evidence from inductive biases is just pretty weaksauce. If you end up 90%+ confident that models will/won't scheme based on inductive bias arguments alone (and no empirics), then something's gone wrong.

3

0

4

Samuel Marks

@saprmarks

4 months

@StephenLCasper Cool, I think agree with you about all this. (BTW you might be interested in chatting with Oam and Rowan about a "probing is actually good" project they're working on! They compare probing model internals to classifiers applied to inputs/outputs.)

0

4

Samuel Marks

@saprmarks

4 months

@NeelNanda5 @KKumar_ai_plans @daniel_271828 I agree with what Neel wrote here, but worth mentioning that certain types of non-mech-int model internals research could possibly be done via a grey-box API that doesn't pose weight exfiltration risks. E.g. grey-box access is sufficient for most steering vectors work.

2

0

4

Samuel Marks

@saprmarks

4 months

@StephenLCasper Can you say more about what seems like a red flag to you / why you think that simple probing is inadequate in this setting?

2

0

4

Samuel Marks

@saprmarks

10 months

@thomasahle Concrete example that I've found surprises people who take the "LMs are myopic predictors" perspective too far.

1

0

4

Samuel Marks

@saprmarks

4 months

I'm also excited to see work applying interpretability to concrete problems, like Redwood's measurement tampering detection benchmark

Benchmarks for Detecting Measurement Tampering

When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where...

arxiv.org

1

0

4

Samuel Marks

@saprmarks

10 months

@CurtTigges Cool stuff! Re info stored above placeholder tokens: I found something similar in my recent work on truth directions. It generally seems to me that models like to store "metadata" about clauses (e.g. sentiment/truth) over these placeholder tokens.

Samuel Marks

@saprmarks

10 months

Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"? In a new paper with @tegmark , we explore how LLMs represent truth. 1/N

10

67

306

1

0

4

Samuel Marks

@saprmarks

9 months

Conceptually speaking, the point is that in the settings I care about, easy-to-hard and truthful-to-untruthful distribution shifts *coincide* (rather than being independent). So I want to win at ELK without having access to labels distinguishing "true" from "what humans believe"

1

0

4

Samuel Marks

@saprmarks

9 months

@DavidSKrueger I disagree that all interpretability work is for assurance. E.g. model editing (for steering), ELK (for scalable oversight), and interp-assisted red-teaming (for robustness). IMO interpretability refers to a class of methods, which can be applied towards a variety of goals.

1

0

3

Samuel Marks

@saprmarks

6 months

@amuuueller @ch402 h/t @MaxNadeau_ for noticing the narrative connection to high/low frequency detectors

1

0

3

Samuel Marks

@saprmarks

27 days

@ericneyman @JoshAEngels That's fine. The more difficult constraint is that we would have to be the only two authors...

1

0

2

Samuel Marks

@saprmarks

2 months

@RichardMCNgo 2. You need to argue that being the human-in-particular which best understands the AI is very valuable above-and-beyond an AI which understands AIs better (and can explain it better to humans).

1

0

3

Samuel Marks

@saprmarks

1 month

@StephenLCasper FWIW I'd consider the doctor vs. nurse experiment from SFC to be a solid 3.5. 3 because the task is toy, but 4 because I don't know of any other technique by which you could accomplish the toy task (i.e. it beats all the non-interp baselines). (I like this tier list)

1

0

3

Samuel Marks

@saprmarks

4 months

A key problem in scalable oversight is that we might not be able to distinguish good vs. bad models on the basis of their I/O behavior alone. Assuming that can't rely on I/O, what's left? I think we need to look at model internals. Let's call this **cognition-based oversight**

1

0

3

Samuel Marks

@saprmarks

2 months

@RichardMCNgo Doesn't seem right to me: 1. Wage increases for professional-athlete-level fitness seems like a contingent result of wealthier societies spending more on leisure, not anything analogous to the situation with AIs surpassing human intelligence

1

0

3

Samuel Marks

@saprmarks

10 months

@ArthurConmy @MaxNadeau_ @thomasahle @gwern I don't think I disagree with you (or Max) about anything here: a purely myopic objective can (and did) give rise to this behavior. That said, some people I've talked to take the point that LMs are trained myopically too far and end up being surprised by this example.

1

0

3

Samuel Marks

@saprmarks

10 months

@lathropa @tegmark It's at the bottom of the thread - here you go

Samuel Marks

@saprmarks

10 months

For much, much more see our paper! paper: code: interactive dataexplorer: 16/16

2

1

6

0

3

Samuel Marks

@saprmarks

7 months

@RichardMCNgo This definitely doesn't ring true of the time I spent in number theory. The feeling was that we were in an era of rapid progress. From first principles, I'd expect the same of most fields: practitioners always feel like progress is rapid and exciting.

2

0

3

Samuel Marks

@saprmarks

1 month

@RichardMCNgo @JeffLadish @jeremyphoward @trevposts "Bayesians aren’t meant to care about how the evidence was gathered"—I'm confused where this claim is coming from. I've never heard anyone say this before, and I've heard various self-identifying Bayesians emphasize the opposite. E.g. this post[] from the

What Evidence Filtered Evidence? — LessWrong

I discussed the dilemma of the clever arguer, hired to sell you a box that may or may not contain a diamond. The clever arguer points out to you that…

www.lesswrong.com

0

3

Samuel Marks

@saprmarks

6 months

@norabelrose @joshua_clymer Without (2), IB arguments will move me some, but not much, even if IBs are well-understood at a basic level. E.g. if you claim "NNs will learn [specific algorithm] to do subject-verb agreement across relative clauses because [IB argument]" I'll update a little, but not much.

0

2

Samuel Marks

@saprmarks

6 months

@amuuueller @ch402 @MaxNadeau_ It seems that this feature is moving the information "plural subject" to the end of adjectival phrases which modify the subject. But this doesn't match all the examples in my screenshot. What's up? In the counterexamples, the model gets confused by plural words in the vicinity!

1

0

3

Samuel Marks

@saprmarks

6 months

@gabemukobi @laurolangosco @ohabryka @AnthropicAI ("commitment" is arguably a bad word for something that no one ever actually committed to)

1

0

3