Samuel Marks Profile
Samuel Marks

@saprmarks

1,173
Followers
85
Following
34
Media
218
Statuses

AI safety research @AnthropicAI . Prev postdoc in LLM interpretability with @davidbau , math PhD at @Harvard , director of technical programs at

Boston
Joined October 2023
Don't wanna be here? Send us removal request.
Pinned Tweet
@saprmarks
Samuel Marks
5 months
Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager , @ericjmichaud_ , @boknilev , @davidbau , @amuuueller
7
61
308
@saprmarks
Samuel Marks
3 months
I had the great pleasure of learning about this about 30mins before the rest of the world when I arrived today to my first day of work at @AnthropicAI and Jan was sitting next to me.
@janleike
Jan Leike
3 months
I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
370
524
9K
3
1
384
@saprmarks
Samuel Marks
10 months
Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"? In a new paper with @tegmark , we explore how LLMs represent truth. 1/N
10
67
306
@saprmarks
Samuel Marks
8 months
I've made a post here about a surprising observation about LLM representations: LLMs seem to linearly represent XORs of arbitrary features, even when there's no reason to do so. I also write about the consequences this has for interp research
7
26
259
@saprmarks
Samuel Marks
2 months
@ericneyman (I got this question wrong.)
2
0
183
@saprmarks
Samuel Marks
4 months
When working on interpretability research, I face two main practical challenges: 1. Tooling: I sometimes feel like a neurosurgeon with dull scalpels 2. Models too big to fit on my machine I'm excited about nnsight & NDIF, a pair of projects aimed at addressing these problems! 🧵
Tweet media one
Tweet media two
1
11
117
@saprmarks
Samuel Marks
27 days
Spotted in a paper draft. Now I can't stop fantasizing about writing Marks & Engels (2024). Want to team up @JoshAEngels ?
Tweet media one
3
1
92
@saprmarks
Samuel Marks
2 months
Idea: make sure AIs never learn dangerous knowledge by censoring their training data Problem: AIs might still infer censored knowledge by "connecting the dots" between individually benign training documents! @j_treutlein and @damichoi95 formalize this phenomenon as "inductive
Tweet media one
@OwainEvans_UK
Owain Evans
2 months
New paper, surprising result: We finetune an LLM on just (x,y) pairs from an unknown function f. Remarkably, the LLM can: a) Define f in code b) Invert f c) Compose f —without in-context examples or chain-of-thought. So reasoning occurs non-transparently in weights/activations!
Tweet media one
33
213
2K
1
7
70
@saprmarks
Samuel Marks
4 months
Constellation -- an AI safety research center in Berkeley, CA -- is launching two new programs! * Visiting Fellows: 3-6 months visiting (w/ travel, housing, & office space covered) * Constellation Residency: 1yr salaried position
1
6
51
@saprmarks
Samuel Marks
6 months
Here's a feature from a Pythia-70m attention head found with an SAE and noticed by my collaborator @amuuueller a few months ago. Months later, it's still my favorite feature. Can you figure out what it's doing?
Tweet media one
2
6
44
@saprmarks
Samuel Marks
1 month
Many people cite "Interpretability for finding adversarial inputs" as the most plausible path-to-impact for interpretability. So I'm very excited to see @loganriggssmith & @BrinkmannJannik 's work using interp to red-team preference models!
1
3
41
@saprmarks
Samuel Marks
10 months
One fun tidbit is that this linear structure emerges over layers: true and false statements separate earlier for simple statements, and later for complicated ones (e.g. "It is the case both that the city of Chicago is in Madagascar and that the city of Beijing is in China"). 3/N
1
2
40
@saprmarks
Samuel Marks
10 months
By the way, if you like these visualizations, you can explore these datasets for yourself at our interactive dataexplorer: 4/N
1
2
35
@saprmarks
Samuel Marks
2 months
Alice: RL agents might learn to directly hack their reward function to always output +1000 Bob: Is that even possible? Why would the agent explore such a crazy action in the first place? This paper: b/c of generalization from easier-to-discover forms of specification gaming
@AnthropicAI
Anthropic
2 months
New Anthropic research: Investigating Reward Tampering. Could AI models learn to hack their own reward system? In a new paper, we show they can, by generalization from training in simpler settings. Read our blog post here:
Tweet media one
22
184
980
0
3
36
@saprmarks
Samuel Marks
8 months
My two top takeaways are: if models linearly represent XORs of arbitrary features, then... 1. this implies qualitatively new reasons that linear probes might fail to generalize. E.g. probe directions can be affected by irrelevant feature which are *constant* in training data!
1
1
30
@saprmarks
Samuel Marks
9 months
This is one of my favorite AI safety papers of this year! The authors train an LLM to imitate the systematic mistakes of a user (Bob, who is bad at addition). Then they show that, even though the LLM lies to Bob, we can still extract truthful answers from its internals.
@alextmallen
Alex Mallen
9 months
New @AiEleuther paper on Eliciting Latent Knowledge! We finetune "quirky" LMs to make systematic errors on arithmetic problems when "Bob" is in the prompt, and show that we can recover their knowledge of the correct answer from their activations. 📄
Tweet media one
3
19
157
1
1
31
@saprmarks
Samuel Marks
4 months
Can you figure out how many interacting circuits are involved in a behavior just by looking at loss curves? Maybe! In this cool paper, @Aaditya6284 et al. study the emergence of circuits in isolation by retraining models with certain activations "clamped" to post-training values
@Aaditya6284
Aaditya Singh
5 months
In-context learning (ICL) circuits emerge in a phase change... Excited for our new work "What needs to go right for an induction head (IH)?" We present "clamping", a method to causally intervene on dynamics, and use it to shed light on IH diversity + formation. Read on 🔎⏬
2
44
198
0
3
30
@saprmarks
Samuel Marks
10 months
For example, before neurosurgery, our LLM was 72% confident that the statement "The Spanish word 'uno' means 'floor'" is false. But by identifying where the LLM stores this info and overwriting it, we can make the LLM 70% confident that the statement is true! 9/N
Tweet media one
1
1
29
@saprmarks
Samuel Marks
5 months
Thrilled to release this preprint (along with my wonderful coauthors!). Stay tuned for our paper thread. And thanks to @StephenLCasper for loudly and insistently advancing his arguments that MI should have use cases -- they were very influential on this work!
@StephenLCasper
Cas (Stephen Casper)
5 months
I've complained about mech interp for years, but I have no complaints about this. If circuits-style MI is going to be useful and competitive for real-world diagnostics/debugging (a big 'if'), I think this is the kind of work that may make this possible.
3
26
170
0
3
27
@saprmarks
Samuel Marks
10 months
Step 1 in our investigation is building datasets of simple, unambiguous true/false statements and visualizing LLM representations of these statements. We see clear linear structure, with true statements separating from false ones! 2/N
Tweet media one
1
1
28
@saprmarks
Samuel Marks
22 days
We want SAEs which interpretably capture "everything the model knows" about its input. But how do we get ground truth on "everything the model knows"? Fun paper with @a_karvonen Ben Wright @can_rager & others! We study board-game-playing LLMs; these models must mentally track an
@a_karvonen
Adam Karvonen
22 days
Our paper—which we presented as an 📢 Oral Presentation 📢 at the ICML Mech Interp workshop—swaps natural language for board games, a setting where we can evaluate Sparse Autoencoders (SAEs) more rigorously! We also introduce p-annealing, a new SAE training method. 🧵
Tweet media one
3
12
72
1
2
26
@saprmarks
Samuel Marks
10 months
Given these observations - and following prior work by @CollinBurns4 , @JacobSteinhardt , and others - we ask: do LLMs represent the truth or falsehood of statements along a single "truth direction"? We provide evidence that the answer is yes! (And set SoTA on finding it) 5/N
Tweet media one
1
4
26
@saprmarks
Samuel Marks
10 months
E.g. a truth direction found using only statements of the form "x is larger/smaller than y" is 97% accurate at classifying statements about Spanish-English translation, like "The Spanish word 'gato' means 'cat'."! 7/N
1
1
25
@saprmarks
Samuel Marks
5 months
How do we discover circuits on these sparse features? We fold sparse autoencoders into the LM’s computation, and use attribution patching to quickly estimate each feature’s contribution to the LM’s output.
2
2
24
@saprmarks
Samuel Marks
8 months
2. Interp work needs to either (a) explain how these XOR feature directions are different from the rest, or (b) be robust to an exponential increase in the number of features in models.
2
1
22
@saprmarks
Samuel Marks
10 months
Rather, we show that a dead-simple alternative works better: just take the direction pointing from the mean of the false datapoints to the mean of the true datapoints. That's it! We show that these "mass-mean" directions work better than LR, especially for neurosurgery. 13/N
1
1
23
@saprmarks
Samuel Marks
10 months
What evidence? And what does it even mean to be a "truth direction"? We show two things: 1. Directions extracted from one true/false dataset are accurate for classifying true/false statements from structurally and topically diverse datasets. 6/N
1
1
22
@saprmarks
Samuel Marks
4 months
In a new blog post, I explain how our work in Sparse Feature Circuits fits into a broader AI safety agenda. I also introduce a concrete problem, Discriminating Behaviorally Indentical Classifier (DBIC), for guiding future interpretability work.
@saprmarks
Samuel Marks
5 months
Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager , @ericjmichaud_ , @boknilev , @davidbau , @amuuueller
7
61
308
1
3
21
@saprmarks
Samuel Marks
10 months
In many ways, this paper could be viewed as a companion paper to @wesg52 and @tegmark 's recent paper on LLM space/time representations. Both ask "Do LLMs have world models? And how do we find them?" Wes focuses on space and time, and I focus on truth. 15/N
@wesg52
Wes Gurnee
11 months
Do language models have an internal world model? A sense of time? At multiple spatiotemporal scales? In a new paper with @tegmark we provide evidence that they do by finding a literal map of the world inside the activations of Llama-2!
182
1K
6K
2
1
20
@saprmarks
Samuel Marks
10 months
For me, the most exciting part is our progress on extracting a truth direction from labeled true/false datasets. But first, I hear the skeptics: "LMs are just statistical engines with no notion of truth! You're probably detecting text which is likely/unlikely, not true/false"
1
1
20
@saprmarks
Samuel Marks
10 months
Okay, let's talk truth direction extraction! The most common way researchers have done this is with logistic regression. But based on geometrical issues arising from the superposition hypothesis of @ch402 and others, we think logistic regression is actually quite bad here! 12/N
Tweet media one
1
1
20
@saprmarks
Samuel Marks
4 months
You're given two models: one is aligned, and one pretends to be aligned when watched. Can you tell the difference? Cool work by @joshua_clymer @kh4dien and @sevdeawesome ! (Aside: I'm proud to have suggested the top-performing "get the models drunk" method to the authors!)
@joshua_clymer
Joshua Clymer
4 months
Can we catch models that fake alignment by examining their internals? Our new paper trains 'sleeper agents' that strategically pretend to be safe and finds methods to unmask them.
Tweet media one
7
13
96
0
1
20
@saprmarks
Samuel Marks
4 months
As new interp techniques applied to larger models produce an explosion in human-interpretable data for analysis, we'll need new ways to scale up our interp workforce. IMO @cogconfluence et al are on exactly the right track here.
@cogconfluence
Sarah Schwettmann
4 months
MAIA (A Multimodal Automated Interpretability Agent) is here! 🧵 📝New paper: 🌐Website: Agents like MAIA advance automated interpretation of AI systems from one-shot feature description into an interactive regime where hypotheses
5
41
148
0
0
17
@saprmarks
Samuel Marks
10 months
2. Even better, we can use our identified truth directions to perform **LLM neurosurgery**, getting our LLM to treat false statements as true and vice versa. 8/N
1
1
16
@saprmarks
Samuel Marks
5 months
One more thank you: thanks to the Neuronpedia team @JBloomAus and @johnnylin ! Going from my first message to them ("Crazy ask, but...") to hosting our SAEs a few days later, they moved incredibly quickly
2
0
16
@saprmarks
Samuel Marks
10 months
BTW, I should shout out the excellent prior work by @ke_li_2021 and @_oampatel_ . Their "Inference-Time Intervention" paper was a major source of inspiration (and lent the name to "mass-mean probing")
1
1
16
@saprmarks
Samuel Marks
10 months
This is a reasonable concern, and we check for it in a few ways. 1. Constructing datasets in which true text diverges from probable text. E.g. Our LLM judges "China is NOT a country in" to mostly likely end "Asia." (See paper for more) 2. Our neurosurgery experiments from above.
2
1
16
@saprmarks
Samuel Marks
10 months
One fun obstacle to extracting truth directions: truth directions you get from different datasets sometimes look very different (e.g. are orthogonal!) Our experiments suggest why: confounding features that correlate inconsistently with truth. Solution: use more diverse data 14/N
Tweet media one
2
1
14
@saprmarks
Samuel Marks
4 months
Let's start with nnsight. You can think of nnsight as an easier-to-read-and-write alternative to pytorch hooks. E.g., see the difference between saving an MLP activation with forward hooks vs. nnsight. The nnsight code is minimal and elegant, a tool that doesn't get in my way.
Tweet media one
1
0
14
@saprmarks
Samuel Marks
5 months
Some interactive demos: 🔍 Browse auto-discovered model behaviors and circuits: 🧠 Explore our SAE features on Neuronpedia: E.g. see this Neuronpedia profile for one of our fav features!
Tweet media one
@saprmarks
Samuel Marks
6 months
Here's a feature from a Pythia-70m attention head found with an SAE and noticed by my collaborator @amuuueller a few months ago. Months later, it's still my favorite feature. Can you figure out what it's doing?
Tweet media one
2
6
44
1
0
14
@saprmarks
Samuel Marks
5 months
Lots of prior work on robustness to spurious signals, from @polkirichenko , @Pavel_Izmailov , @shiorisagawa , @OrgadHadas , @norabelrose , @ravfogel , @AdtRaghunathan , & many more Our work is different: it works even if the spurious & intended signals are equally predictive of labels
1
0
13
@saprmarks
Samuel Marks
5 months
But LMs have many difficult-to-anticipate behaviors and mechanisms. Can we get circuits for those as well? Yes! By combining our method with @ericjmichuad_ / @tegmark ’s quanta discovery methods, we automatically discover thousands of feature circuits from raw text data.
@ericjmichaud_
Eric J. Michaud
1 year
Our method, which we call QDG for "quanta discovery from gradients" is based on spectral clustering with model gradients. With QDG, we auto-discover a variety of capabilities/behaviors of a small language model. Here are a couple of clusters:
Tweet media one
1
0
14
1
1
13
@saprmarks
Samuel Marks
2 months
@RichardMCNgo Biggest complaints: 1. Really needs to have better fast travel 2. Impending deadline gives a stressful Sword of Damocles feel, discourages side quests 3. Illness: adds absolutely nothing, who requested this feature??
1
0
13
@saprmarks
Samuel Marks
5 months
See our project website, paper, and github repo for more! 🌐 📜 🧑‍💻
2
0
13
@saprmarks
Samuel Marks
5 months
Feature circuits consist of fine-grained, interpretable units, and are useful downstream! For example, we can fully debias a profession classifier, even if profession and gender are perfectly correlated in all our classification data.
1
1
11
@saprmarks
Samuel Marks
5 months
There’s a lot of great work from @ch402 , @NeelNanda5 , @ArthurConmy , @rowankwang et al. on discovering LM circuits. But these circuits are difficult to apply b/c they localize LM behaviors to MLPs/attn heads/neurons, which are difficult to interpret and play many roles at once
1
0
11
@saprmarks
Samuel Marks
5 months
Instead, we follow @HoagyCunningham @TrentonBricken et al., and use sparse autoencoders to automatically discover meaningful directions to build circuits with.
@AnthropicAI
Anthropic
11 months
Dictionary learning works! Using a "sparse autoencoder", we can extract features that represent purer concepts than neurons do. For example, turning ~500 neurons into ~4000 features uncovers things like DNA sequences, HTTP requests, and legal text. 📄
Tweet media one
6
99
871
1
0
11
@saprmarks
Samuel Marks
5 months
What “interpretable units” are our circuits are built from? If we knew what the “right concepts” were ahead of time, we could use methods like DAS (from @ChrisGPotts @noahdgoodman @ZhengxuanZenWu ) to find corresponding directions in our LM’s latent space. But we don’t!
@ZhengxuanZenWu
Zhengxuan Wu
1 year
📣How does 🔥Alpaca🦙 follow your instructions? Mechanistic interpretability at scale – our new paper identifies the causal mechanisms the Alpaca 7B model uses to solve simple reasoning tasks (with Atticus Geiger, @ChrisGPotts , and @noahdgoodman !) Paper:
Tweet media one
2
60
251
1
0
11
@saprmarks
Samuel Marks
5 months
Finally, thanks to @loganriggssmith and @BrinkmannJannik for help training SAEs and to @StephenLCasper and @NeelNanda5 among others for useful discussion!
1
0
10
@saprmarks
Samuel Marks
8 months
Much more in the post.
0
1
10
@saprmarks
Samuel Marks
5 months
For example, here’s a circuit we discovered for subject-verb agreement. The features in the circuit are easy to interpret and illustrate an intuitive algorithm.
Tweet media one
1
0
10
@saprmarks
Samuel Marks
4 months
FAQ: Is this meant to replace @NeelNanda5 's TransformerLens? A: No! nnsight tries to provide a general-purpose, model-agnostic interpretability API. TL tries to provide a unified approach for interp on transformers specifically. I view these as mostly complementary goals.
1
0
8
@saprmarks
Samuel Marks
4 months
NDIF currently hosts five models, including two Llama-3 models, and with plans to add more! If Meta open-sources a 400B+ parameter Llama-3 model, we're hoping NDIF will make it easy for a variety of researchers to study its internals.
Tweet media one
1
0
9
@saprmarks
Samuel Marks
5 months
Feature circuits provide useful insights into automatically discovered LM behaviors. For example, while some behaviors seem like cohesive single mechanisms, circuits-level analysis reveals a union of distinct mechanisms, like multiple “narrow” induction patterns.
Tweet media one
1
0
8
@saprmarks
Samuel Marks
6 months
@joshua_clymer I agree that there are non-model-internals-based sources of evidence about whether models are faking alignment. (In fact, I expect most of our evidence will come from such sources). That said, I don't love your examples here.
2
0
8
@saprmarks
Samuel Marks
10 months
Our alignment scheme for children (+ pets) is quite similar to RLHF: children do stuff, we look at what they do, and we reward them (with e.g. candy, praise) if we like what we see. This scheme suffers from the same issues I expect we'll encounter in aligning AI systems: 1/N
@AndrewYNg
Andrew Ng
10 months
Argh, my son just figured out how to steal chocolate from the pantry, and made a brown, gooey mess. Worried about how hard it is to align AI? Right now I feel I have better tools to align AI with human values than align humans with human values, at least in the case of my 2
139
67
1K
1
1
8
@saprmarks
Samuel Marks
6 months
@amuuueller Answer: it's activating on the final token of relative clauses or prepositional phrases which modify a sentence's subject!
2
0
7
@saprmarks
Samuel Marks
4 months
In fact, nnsight piggybacks on TL's great uniformization work via the UnifiedTransformer class, a crazy frankenstein mixture of nnsight and TL. You can write nnsight code which extracts activations from the standard TL hookpoints!
Tweet media one
1
0
6
@saprmarks
Samuel Marks
4 months
Huge kudos to @jadenfk23 , who is currently the main manpower behind nnsight/NDIF, and never complains when I need tech support at 2am🙂 Thanks to @davidbau for having the vision for these projects See & to learn more/get involved
0
0
7
@saprmarks
Samuel Marks
4 months
Here’s another example, where I do a concise and readable replication of part of the ROME paper for LLaMA-2-7B, showing that patching the MLP layer 1-5 activations from "Eiffel Tower" to "Collosseum" changes the model's prediction for "The Eiffel Tower is in the city of..."
Tweet media one
1
0
7
@saprmarks
Samuel Marks
6 months
@amuuueller @ch402 sometimes talks about his experience discovering high/low frequency detectors in image models: initially, they seemed messy and confusing, but are actually beautiful and structured once you understand what's going on. Just so with this feature :)
1
0
7
@saprmarks
Samuel Marks
10 months
For much, much more see our paper! paper: code: interactive dataexplorer: 16/16
2
1
6
@saprmarks
Samuel Marks
2 months
@RichardMCNgo 1. Navigating the development of a transformative technology is intellectually and strategically stimulating 2. Good replay value: lots of interesting player characters to replay as 3. Satisfying win condition: becoming immortal 4. One Earth-sized planet is a good gameboard size
1
0
6
@saprmarks
Samuel Marks
6 months
@laurolangosco @ohabryka @AnthropicAI It also depends on whether the spirit of the commitment is about (1) not creating hype, (2) not contributing to race dynamics, or (3) not actually advancing the frontier. This arguably does (1) (OTOH the announcement is pretty toned-down IMO)
1
0
6
@saprmarks
Samuel Marks
4 months
Okay, on to NDIF. NDIF is computing infrastructure for remote execution of white-box experiments on large models. You write nnsight code and it gets remotely executed on NDIF,. From your perspective, it's as if you had a giant model on your own machine!
@davidbau
David Bau
4 months
I am delighted to officially announce the National Deep Inference Fabric project, #NDIF . NDIF is an @NSF -supported computational infrastructure project to help YOU advance the science of large-scale AI.
Tweet media one
9
62
279
1
0
6
@saprmarks
Samuel Marks
4 months
In cognition-based oversight, we evaluate models based on whether they perform intended cognition, not whether they produce intended outputs. E.g. a possible workflow here is: inspect a model's cognition in search of "red flags." This is how SHIFT works.
Tweet media one
@saprmarks
Samuel Marks
5 months
Feature circuits consist of fine-grained, interpretable units, and are useful downstream! For example, we can fully debias a profession classifier, even if profession and gender are perfectly correlated in all our classification data.
1
1
11
1
1
6
@saprmarks
Samuel Marks
16 days
@RichardMCNgo Those all seem like the same concept to me, but I seem smart in conversations.
0
0
6
@saprmarks
Samuel Marks
4 months
One more example, this one a live piece of research code. In Sparse Feature Circuits, we inserted SAEs into models, but wanted a weird backward pass that e.g. pretended the SAEs weren't there. Solution: use nnsight to intervene on the backward pass!
Tweet media one
1
0
5
@saprmarks
Samuel Marks
4 months
@StephenLCasper Clearly you might worry that this result is an artifact of the way the sleeper agent was trained. But based on "probing has a ton of challenges IRL," it sounds like you think this approach is deficient in a more fundamental way?
1
0
5
@saprmarks
Samuel Marks
9 months
@trevposts My current best guess is that the "sources close to Altman" were strategically leaking false or misleading info. In addition to the reasons you named, the leak probably also helps Sam's bargaining position for whatever the new venture is.
0
0
5
@saprmarks
Samuel Marks
4 months
nnsight supports any huggingface model (in fact, any torch.nn.Module). For example @arnab_api has used nnsight to study Mamba!
@arnab_api
Arnab Sen Sharma
4 months
Then we ask: can we apply ROME to update/insert a fact in Mamba? ROME can be applied directly as well! When applied to different projection matrices in MambaBlock across all the layers, we see that ROME can successfully update/insert a fact in a range of layers.
Tweet media one
1
0
3
1
0
4
@saprmarks
Samuel Marks
1 month
@RichardMCNgo @trevposts @JeffLadish @jeremyphoward The way I would prefer to say this is: we have uncertainty over what experimental procedure the experimenter used (e.g. whether the procedure involved p-hacking), and the researcher's mindset provides evidence about the procedure used.
0
0
5
@saprmarks
Samuel Marks
4 months
For example, suppose I want to repeat my LLaMA-2-7B experiment from before with the 10x larger LLaMA-2-70B. Just swap out "7b" for "70b" and a "remote=True" argument
Tweet media one
1
0
5
@saprmarks
Samuel Marks
1 month
@loganriggssmith @BrinkmannJannik FWIW I don't think interp for finding adversarial inputs is the *most impactful* interp application. But I'd guess it's the most likely one to pan out. I also think that research which gets interp some practical wins (even wins not on the most critical path) is worthwhile.
1
0
5
@saprmarks
Samuel Marks
10 months
@MaxNadeau_ @thomasahle I agree that this doesn't require planning, but I think it's at least some intuitive evidence for planning-adjacent cognition. Concretely, I bet that the information that the model intends to output "orange" is already present when the model outputs a/an.
1
0
4
@saprmarks
Samuel Marks
11 months
I had a great time speaking at @MITFutureTech on my recent paper with @tegmark (Twitter thread on the paper coming soon!)
@MITFutureTech
MIT FutureTech
11 months
Samuel Marks discussing research with @tegmark exploring whether language models linearly represent the truth or falsehood of factual statements. #MITAIScaling @MITFutureTech
Tweet media one
0
0
3
0
0
3
@saprmarks
Samuel Marks
9 months
@ch402 When I was teaching linalg + multivariate calculus at Harvard in Spring 2022, **PCA** wasn't even part of the curriculum! I indeed expect this to change, with more data science/DL applications and less differential equations (~half the course when I was teaching!)
0
0
2
@saprmarks
Samuel Marks
6 months
@joshua_clymer 2. Evidence from inductive biases is just pretty weaksauce. If you end up 90%+ confident that models will/won't scheme based on inductive bias arguments alone (and no empirics), then something's gone wrong.
3
0
4
@saprmarks
Samuel Marks
4 months
@StephenLCasper Cool, I think agree with you about all this. (BTW you might be interested in chatting with Oam and Rowan about a "probing is actually good" project they're working on! They compare probing model internals to classifiers applied to inputs/outputs.)
0
0
4
@saprmarks
Samuel Marks
4 months
@NeelNanda5 @KKumar_ai_plans @daniel_271828 I agree with what Neel wrote here, but worth mentioning that certain types of non-mech-int model internals research could possibly be done via a grey-box API that doesn't pose weight exfiltration risks. E.g. grey-box access is sufficient for most steering vectors work.
2
0
4
@saprmarks
Samuel Marks
4 months
@StephenLCasper Can you say more about what seems like a red flag to you / why you think that simple probing is inadequate in this setting?
2
0
4
@saprmarks
Samuel Marks
10 months
@thomasahle Concrete example that I've found surprises people who take the "LMs are myopic predictors" perspective too far.
Tweet media one
Tweet media two
1
0
4
@saprmarks
Samuel Marks
10 months
@CurtTigges Cool stuff! Re info stored above placeholder tokens: I found something similar in my recent work on truth directions. It generally seems to me that models like to store "metadata" about clauses (e.g. sentiment/truth) over these placeholder tokens.
@saprmarks
Samuel Marks
10 months
Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"? In a new paper with @tegmark , we explore how LLMs represent truth. 1/N
10
67
306
1
0
4
@saprmarks
Samuel Marks
9 months
Conceptually speaking, the point is that in the settings I care about, easy-to-hard and truthful-to-untruthful distribution shifts *coincide* (rather than being independent). So I want to win at ELK without having access to labels distinguishing "true" from "what humans believe"
1
0
4
@saprmarks
Samuel Marks
9 months
@DavidSKrueger I disagree that all interpretability work is for assurance. E.g. model editing (for steering), ELK (for scalable oversight), and interp-assisted red-teaming (for robustness). IMO interpretability refers to a class of methods, which can be applied towards a variety of goals.
1
0
3
@saprmarks
Samuel Marks
6 months
@amuuueller @ch402 h/t @MaxNadeau_ for noticing the narrative connection to high/low frequency detectors
1
0
3
@saprmarks
Samuel Marks
27 days
@ericneyman @JoshAEngels That's fine. The more difficult constraint is that we would have to be the only two authors...
1
0
2
@saprmarks
Samuel Marks
2 months
@RichardMCNgo 2. You need to argue that being the human-in-particular which best understands the AI is very valuable above-and-beyond an AI which understands AIs better (and can explain it better to humans).
1
0
3
@saprmarks
Samuel Marks
1 month
@StephenLCasper FWIW I'd consider the doctor vs. nurse experiment from SFC to be a solid 3.5. 3 because the task is toy, but 4 because I don't know of any other technique by which you could accomplish the toy task (i.e. it beats all the non-interp baselines). (I like this tier list)
1
0
3
@saprmarks
Samuel Marks
4 months
A key problem in scalable oversight is that we might not be able to distinguish good vs. bad models on the basis of their I/O behavior alone. Assuming that can't rely on I/O, what's left? I think we need to look at model internals. Let's call this **cognition-based oversight**
Tweet media one
1
0
3
@saprmarks
Samuel Marks
2 months
@RichardMCNgo Doesn't seem right to me: 1. Wage increases for professional-athlete-level fitness seems like a contingent result of wealthier societies spending more on leisure, not anything analogous to the situation with AIs surpassing human intelligence
1
0
3
@saprmarks
Samuel Marks
10 months
@ArthurConmy @MaxNadeau_ @thomasahle @gwern I don't think I disagree with you (or Max) about anything here: a purely myopic objective can (and did) give rise to this behavior. That said, some people I've talked to take the point that LMs are trained myopically too far and end up being surprised by this example.
1
0
3
@saprmarks
Samuel Marks
10 months
@lathropa @tegmark It's at the bottom of the thread - here you go
@saprmarks
Samuel Marks
10 months
For much, much more see our paper! paper: code: interactive dataexplorer: 16/16
2
1
6
0
0
3
@saprmarks
Samuel Marks
7 months
@RichardMCNgo This definitely doesn't ring true of the time I spent in number theory. The feeling was that we were in an era of rapid progress. From first principles, I'd expect the same of most fields: practitioners always feel like progress is rapid and exciting.
2
0
3
@saprmarks
Samuel Marks
1 month
@RichardMCNgo @JeffLadish @jeremyphoward @trevposts "Bayesians aren’t meant to care about how the evidence was gathered"—I'm confused where this claim is coming from. I've never heard anyone say this before, and I've heard various self-identifying Bayesians emphasize the opposite. E.g. this post[] from the
0
0
3
@saprmarks
Samuel Marks
6 months
@norabelrose @joshua_clymer Without (2), IB arguments will move me some, but not much, even if IBs are well-understood at a basic level. E.g. if you claim "NNs will learn [specific algorithm] to do subject-verb agreement across relative clauses because [IB argument]" I'll update a little, but not much.
0
0
2
@saprmarks
Samuel Marks
6 months
@amuuueller @ch402 @MaxNadeau_ It seems that this feature is moving the information "plural subject" to the end of adjectival phrases which modify the subject. But this doesn't match all the examples in my screenshot. What's up? In the counterexamples, the model gets confused by plural words in the vicinity!
1
0
3
@saprmarks
Samuel Marks
6 months
@gabemukobi @laurolangosco @ohabryka @AnthropicAI ("commitment" is arguably a bad word for something that no one ever actually committed to)
1
0
3