Arthur Conmy @ArthurConmy Twitter profile | Pikagi

Pikagi

Arthur Conmy

@ArthurConmy

1,990

Followers

1,087

Following

25

Media

327

Statuses

Aspiring 10x reverse engineer @GoogleDeepMind

London, England

https://t.co/3mcPA7wvct

Joined August 2021

Don't wanna be here? Send us removal request.

Pinned Tweet

@ArthurConmy

Arthur Conmy

3 months

I have more Sparse Autoencoder research ideas than I’ve ever had before with other topics. We hope you can try new projects with these open SAEs!

@NeelNanda5

Neel Nanda

3 months

Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work

18

147

979

1

0

88

Last Seen Profiles

@albytyurfaceoff

@JohnLincolnusa

@siroibo2

@DokuPOTY

@tastykuri

@lopezmanuela_

@Thomasst342014

@WhiteangelBe

@no_Matsuri_

@okifur

@FISTFootball

@tshila97603055

@MMR_LoO

@amarmqmqm

@ANDIBABY8

@dahua_lu

@Matteo707947014

@BrianIsBeering

@camsaudiovisual

@shin_kizulou

@Bugbearv854

@Winson223642

@Gio_Dusi

@rdmakbar

@nailienjoyer

@abdullah113438

@rikopin_miuuuu

@oemmnn

@KateCaitlin_008

@MKlaassen020

@Andres_Jaimes_H

@BookaBookshop

@ChampionshipNor

@PestieW

@kayanalomda

@sintonnison606

@ArthurConmy

Arthur Conmy

10 months

Excited to announce that I’ve joined @GoogleDeepMind scalable alignment team, scaling interpretability!

Tweet media one

16

7

507

@ArthurConmy

Arthur Conmy

1 year

How can we speed up Mechanistic Interpretability? Researchers spend a lot of time searching for the internal model components that matter. We introduce the Automatic Circuit DisCovery (ACDC) ⚡ algorithm! 1/N 🧵

Tweet media one

4

42

299

@ArthurConmy

Arthur Conmy

3 months

fuck sake, just lost a $50 bet from July 2022 with @MichaelTrazzi that AI wouldn’t get an IMO silver before 2025. It got one point off a gold…

@GoogleDeepMind

Google DeepMind

@GoogleDeepMind

3 months

We’re presenting the first AI to solve International Mathematical Olympiad problems at a silver medalist level.🥈 It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system. 🧵

301

1K

5K

5

3

231

@ArthurConmy

Arthur Conmy

7 months

How much can you steal from an LLM API that returns logprobs? 🧵 In our new paper, collaborators noticed that the LLM vocab size is always bigger than the hidden dimension, so logprobs lie inside a hidden-dimension sized subspace, so we can steal that dimension.

Tweet media one

6

17

204

@ArthurConmy

Arthur Conmy

16 days

comforting that Anthropic are working on top of my rushed papers from last year's ICLR while I rush this year's ICLR papers :))

1

1

172

@ArthurConmy

Arthur Conmy

17 days

Newsom: we also need to regulate small models and companies Pelosi: thanks for not regulating small models and companies ???

@SpeakerPelosi

Nancy Pelosi

17 days

AI springs from California. Thank you, @CAgovernor Newsom, for recognizing the opportunity and responsibility we all share to enable small entrepreneurs and academia – not big tech – to dominate.

186

124

666

7

5

170

@ArthurConmy

Arthur Conmy

5 months

I heard several takes in the past few days that safety folk leaving OpenAI was evidence that they trusted OpenAI and safety was going well. Seems unlikely given this thread.

@janleike

Jan Leike

5 months

I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.

48

498

4K

8

4

141

@ArthurConmy

Arthur Conmy

1 year

⚡ACDC was accepted as a *spotlight* at NeurIPS 2023! 📜 Paper (updated today): With @MavorParker @aengus_lynch1 @sheimersheim @AdriGarriga

Tweet card media

Towards Automated Circuit Discovery for Mechanistic Interpretability

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process...

@ArthurConmy

Arthur Conmy

1 year

How can we speed up Mechanistic Interpretability? Researchers spend a lot of time searching for the internal model components that matter. We introduce the Automatic Circuit DisCovery (ACDC) ⚡ algorithm! 1/N 🧵

Tweet media one

4

42

299

3

7

95

@ArthurConmy

Arthur Conmy

2 months

I’m working with Neel once more mentoring researchers doing mechanistic interpretability work. I’d be grateful if you applied, or pass it on to someone you know who could benefit from this!

@NeelNanda5

Neel Nanda

2 months

The deadline to apply to my and @ArthurConmy 's MATS streams is Aug 30, in 11 days! If you want to transition into mechanistic interpretability research, or accelerate your work if you're already in the field, I'd be excited to get your application. All backgrounds welcome!

3

11

70

2

4

60

@ArthurConmy

Arthur Conmy

6 months

We’re excited to share Gated SAEs: an improvement to Sparse Autoencoder training that scales to 7B parameter models (at least!)

@sen_r

Senthooran Rajamanoharan

6 months

New @GoogleDeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders. They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ @ArthurConmy

Tweet media one

5

26

177

0

3

60

@ArthurConmy

Arthur Conmy

6 months

An update on our work on SAEs. Stay tuned for our upcoming SAE Pareto improvement too… :)

@NeelNanda5

Neel Nanda

6 months

Announcing a progress update from the @GoogleDeepMind mech interp team! Inspired by @AnthropicAI 's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.

Tweet media one

4

41

387

1

3

55

@ArthurConmy

Arthur Conmy

2 months

Sparse Autoencoders have millions of features that often correspond to human concepts: we need automation to label them. @AnthropicAI @AiEleuther both use a smarter model to label the max activations, but here @dmhook + @neverrixx use the model itself (Like @ghandeharioun et al)

@dmhook

Dmitrii Kharlapenko

2 months

We use LLM’s capabilities to explain concepts from their minds in my and @neverrixx abstract SAE features research. Excited to continue our MATS 6.0 work under the mentorship of @NeelNanda5 and @ArthurConmy . More cool stuff to come!

4

8

70

1

3

53

@ArthurConmy

Arthur Conmy

7 months

I'll be supervising Mechanistic Interpretability at this summer's MATS program-fill in this short airtable form by Sunday (24th March)! From my mentee and (co-)mentor experience, it's a great program to do important work understanding how AIs work!

6

7

52

@ArthurConmy

Arthur Conmy

10 months

I appreciate how this work gives a detailed account of mech interp failing, and why. Seems great for getting a realistic picture of what working on challenging research is like. Plus, prior work seems to have been on the right track (eg Geva et al, Hernandez et al)

@NeelNanda5

Neel Nanda

10 months

My first @GoogleDeepMind project: How do LLMs recall facts? Early MLP layers act as a lookup table, with significant superposition! They recognise entities and produce their attributes as directions. We suggest viewing fact recall as a black box making "multi-token embeddings”

Tweet media one

7

154

1K

0

2

49

@ArthurConmy

Arthur Conmy

3 months

Our GDM mech interp team just gave access to SAEs for all sites and layers in Gemma-2 9B to people who filled in the form. Come use @lieberum_t and @sen_r and team’s great work!

@NeelNanda5

Neel Nanda

3 months

I'm very excited for the open source release! SAEs are a very promising technique, but there's a LOT of work to do and we need the help. But they're expensive to train, so most work happens in industry labs or on tiny models. Apply for 9B early access:

1

0

21

1

2

40

@ArthurConmy

Arthur Conmy

1 year

I recently finished MATS in Neel's stream and highly, highly recommend. The best mentors are both technically proficient and great at teaching, and Neel has both (look at his research and MechInterp materials!). Neel helped from start-end with our paper

Tweet card media

Copy Suppression: Comprehensively Understanding an Attention Head

We present a single attention head in GPT-2 Small that has one main role across the entire training distribution. If components in earlier layers predict a certain token, and this token appears...

@NeelNanda5

Neel Nanda

1 year

Do you want to do @ch402 -style mechanistic interpretability research? I'm looking for scholars to mentor via MATS - apply by Nov 10th! I've had a lot of fun with past scholars, and they've done fantastic peer-reviewed work, I'm excited for my next batch!

7

36

177

0

4

40

@ArthurConmy

Arthur Conmy

17 days

notebooklm giveth, and notebooklm taketh away

Tweet media one

Tweet media two

0

1

37

@ArthurConmy

Arthur Conmy

1 year

Want to use ACDC? Use our code (built on top of @NeelNanda5 's TransformerLens) here: Want to use a notebook? See Want to dig into the ACDC algorithm and see the results? See the updated arXiv paper 7/N

Tweet card media

Towards Automated Circuit Discovery for Mechanistic Interpretability

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process...

1

4

36

@ArthurConmy

Arthur Conmy

7 months

This is a great way to do an interpretability investigation. A lot in this paper.

@saprmarks

Samuel Marks

7 months

Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager , @ericjmichaud_ , @boknilev , @davidbau , @amuuueller

7

61

310

0

3

32

@ArthurConmy

Arthur Conmy

3 months

Stealing part of a production language model won an ICML best paper award! And shoutout to @itay__yona who led a demonstration of 'stealing' the Pythia-14M unembedding matrix up to orthogonality in the updated arXiv version!

@rsalakhu

Russ Salakhutdinov

3 months

ICML 2024: Best Paper Awards: Florian Tramèr; Gautam Kamath; Nicholas Carlini: Considerations for Differentially Private Learning with Large-Scale Public Pretraining Akbir Khan; John Hughes; Dan Valentine; Laura Ruis; Kshitij Sachan; Ansh Radhakrishnan; Edward Grefenstette;

1

12

56

1

0

32

@ArthurConmy

Arthur Conmy

3 months

One concern about SAE research is that it won't work transfer well to instruction-tuned chat models, or will require prohibitively expensive re-training for these new models. Excitingly, we find that SAEs trained on the base model transfer extremely well to a range of IT models.

@Connor_Kissane

Connor Kissane

@Connor_Kissane

3 months

New post with @robertzzk , @ArthurConmy , & @NeelNanda5 : Sparse Autoencoders (usually) Transfer between Base and Chat Models! This suggests that models' representations remain extremely similar after fine-tuning.

Tweet media one

3

4

42

0

3

32

@ArthurConmy

Arthur Conmy

17 days

This was another interesting finding from @Connor_Kissane : - **Base** models trained today refuse like chat models: "I'm sorry, but I cannot provide assistance..." - This even happened before ChatGPT, but it's a bit different, e.g. Llama-1 says "I'm sorry, I'm not familar..."

@Connor_Kissane

Connor Kissane

@Connor_Kissane

17 days

New post w/ @ArthurConmy & @NeelNanda5 : Base LLMs refuse too! Just at a lower rate than chat models. This implies that refusing harmful requests is not a novel behavior learned during chat fine-tuning, contrary to popular belief.

Tweet media one

8

8

68

2

0

31

@ArthurConmy

Arthur Conmy

2 months

The technical report to Gemma Scope is on arXiv:

@_akhaliq

AK

2 months

Gemma Scope Open Sparse Autoencoders Everywhere All At Once on Gemma 2 discuss: Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features.

1

19

119

1

2

28

@ArthurConmy

Arthur Conmy

9 months

This seems like a paper that could slip under the radar of AI safety researchers but it seems an important contribution for monitoring. 👀In CoT *generation* logical steps are easier than attribution (extracting info from prompt), but in CoT *verification* attribution is easier

@alon_jacovi

Alon Jacovi

9 months

👋 Check out our new paper and benchmark: REVEAL, a dataset with step-by-step correctness labels for chain-of-thought reasoning in open-domain QA 🧵🧵🧵

Tweet media one

3

36

180

2

4

25

@ArthurConmy

Arthur Conmy

4 months

The success of Attention SAEs was a sizable update that sparse linear representations really closely match how LLMs work inside. Congrats @Connor_Kissane @robertzzk ! Training them Just Works, and they improve circuit analysis.

@NeelNanda5

Neel Nanda

4 months

Great post from my scholars @Connor_Kissane & @robertzzk ! SAEs are fashionable, but are they a useful tool for researchers? They are! We find a deeper understanding of the well-studied IOI circuit, and make a circuit analysis tool $1000 bounty to whoever finds the best circuit!

2

11

73

0

2

26

@ArthurConmy

Arthur Conmy

11 months

@_jasonwei All human Top1 accuracies are worse than even a 350M model (a small GPT-3) here

Tweet card media

Language models are better than humans at next-token prediction

Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at...

0

0

25

@ArthurConmy

Arthur Conmy

10 months

I’ll be presenting our interpretability work (w/ coauthors) in just under two hours! Come say hello :)

@farairesearch

FAR.AI

10 months

🌟⚡🔍 #NeurIPS2023 Spotlight Poster: Discover how the ACDC algorithm skillfully identifies essential model components. Join the session on "Towards Automated Circuit DisCovery for Mechanistic Interpretability" on Dec 12, 5:15 PM CST poster #1503 by @AdriGarriga & team.

0

2

6

2

0

24

@ArthurConmy

Arthur Conmy

7 months

The dogit lens 🐶 and ViT interpretability!

@soniajoseph_

Sonia 🌻

7 months

I'm excited to release Prisma, a mechanistic interpretability library for multimodal models like CLIP and ViTs. Incubated at @tyrell_turing 's lab & in collab with @NeelNanda5 . Recent mech interp work has focused on language, but many techniques transfer. Behold, the dogit lens:

Tweet media one

14

80

506

2

1

23

@ArthurConmy

Arthur Conmy

7 months

Regardless, using gradient descent seems a plausible method that could steal unembedding matrices from logprobs up to an orthogonal matrix, could you get it to work? Paper link: Important context for using these attacks:

Tweet card media

Stealing Part of a Production Language Model

We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our...

@brianryhuang

Brian Huang

8 months

did they just undercut openlogprobs? 😭

Tweet media one

3

6

105

0

1

23

@ArthurConmy

Arthur Conmy

7 months

I was really impressed when I first saw the alignment team's results in this paper, e.g. the cmd_injection transcript from Appendix D.2 where Gemini 1.0 successfully finds an exploit where bash commands can be injected into URLs 😯

Tweet media one

@tshevl

Toby

7 months

In 2024, the AI community will develop more capable AI systems than ever before. How do we know what new risks to protect against, and what the stakes are? Our research team at @GoogleDeepMind built a set of evaluations to measure potentially dangerous capabilities: 🧵

Tweet media one

8

47

235

1

1

23

@ArthurConmy

Arthur Conmy

1 year

“We present a single attention head in GPT-2 Small that has one main role across the entire training distribution … this is the most comprehensive description of the complete role of a component in a language model to date.” Excited to share our new paper!

@calsmcdougall

Callum McDougall

1 year

Can we reverse-engineer LLM components across the whole data distribution? Yes! We find a GPT-2 attention head that "copy suppresses": it responds to naive copying from earlier layers and suppresses it. This is almost the *only* thing the head does! 🤯🧵

Tweet media one

4

59

306

0

1

23

@ArthurConmy

Arthur Conmy

2 months

@MichaelTrazzi

Tweet media one

1

1

22

@ArthurConmy

Arthur Conmy

8 months

I like this work from colleagues! 1. Gradient-based patching is unreasonably effective 2. Gradient-based patching bakes in assumptions about linearity which *almost* always hold. I think the fix to take into account how attention softmax is not linear is cool (post 2 in thread)

@JanosKramar

János Kramár

8 months

New @GoogleDeepmind mech interp research! 🎉 Can we massively speed up the process of finding important nodes in LLMs? Yes! Introducing AtP*, an improved variant of Attribution Patching (AtP) that beats all our baselines on efficiency and effectiveness. 🧵

Tweet media one

3

32

155

0

0

21

@ArthurConmy

Arthur Conmy

1 year

We’ve developed a method that is much faster than my previous work on automatically finding interpretable circuits in models. Great work from @aaquib_syed1 and @can_rager ! Explanation and surprising discoveries in this🧵

@aaquib_syed1

Aaquib Syed

1 year

🔬 Interpretability could unravel the mysteries of LLMs! Our latest research finds an extremely efficient algorithm that generally outperforms existing work in identifying key subnetworks (circuits). Link . Explanation 👇🧵

Tweet media one

2

25

171

1

1

21

@ArthurConmy

Arthur Conmy

6 months

We just uploaded an update to our Gated SAEs paper to arXiv: We added more points to graphs and ran eval on a fix set of tokens, and added numbers in the appendices.

Tweet media one

@sen_r

Senthooran Rajamanoharan

6 months

To show this works at scale, we train a range of SAEs (both normal and Gated) up to Gemma 7B on a range of layers and attention, MLP and residual activations, in the process showing it's practical to scale SAEs to 7B. Gated SAEs are consistently a Pareto improvement.

Tweet media one

1

2

11

0

1

20

@ArthurConmy

Arthur Conmy

3 months

Train 24k feature SAEs on context length 128 embeddings and they only learn ~50 features that represent position embeddings and nothing else. My guess is that the longer tail of ~150 positional features are ANDs of positions and some tokens (L1 training creates AND features).

@bilalchughtai_

Bilal Chughtai 🇵🇸

@bilalchughtai_

3 months

Very short informal research post investigating positional SAE features and SAE length generalization.

0

7

71

1

0

20

@ArthurConmy

Arthur Conmy

10 months

So happy this work is out! I think this thread summarizes our paper well :) A really nice mechanistic interpretability result; structure that recurs across model size and in different numeric tokens.

@rgould0

Rhys Gould

10 months

How do LLMs represent sequences? We find they use a single attention head to map ‘one’ to ‘two’, ‘Tuesday’ to ‘Wednesday’, and ‘June’ to ‘July’! 🧵➡️Our new mechinterp paper studies these ‘successor heads’, and finds they act on shared numerical structure in token embeddings!

Tweet media one

3

25

149

1

0

19

@ArthurConmy

Arthur Conmy

1 year

We used ACDC (as well as existing methods) to rederive the structures of circuits from previous works with no human-in-the-loop! E.g Greater-Than ( @michaelwhanna + @coulispi + @A_Variengien ), IOI ( @kevrowan et al.), tracr ( @davlindner et al.), Docstrings ( @sheimersheim + @jettjaniak )

Tweet media one

1

3

18

@ArthurConmy

Arthur Conmy

10 months

Some extremely scrappy thoughts on training 1L Sparse Autoencoders I did a month ago:

Tweet card media

My best guess at the important tricks for training 1L SAEs — LessWrong

TL;DR: this quickly-written post gives a list of my guesses of the most important parts of training a Sparse Autoencoder on a 1L Transformer, with op…

www.lesswrong.com

0

1

16

@ArthurConmy

Arthur Conmy

7 months

We can get more than this! If a model uses a LayerNorm before the unembedding, it projects the hidden state onto a subspace that is one dimension smaller. We can detect this by centering the output (see Appendix B.2), and we show how to steal whether models use LayerNorm/RMSNorm!

Tweet media one

1

0

16

@ArthurConmy

Arthur Conmy

2 months

@DavidSKrueger @RolandMemisevic Leo cited you , So did we (but we call it JumpReLU as we preferred that name). ProLU was just a blog post, seems unreasonable to police citing norms there

Tweet card media

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU...

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for...

1

0

17

@ArthurConmy

Arthur Conmy

9 months

Rob and Connor did a great job on this first post and if you're into Sparse Autoencoders, I'd highly recommend following their ongoing work understanding attention!

@NeelNanda5

Neel Nanda

9 months

Thanks to @AnthropicAI for the shout out to my MATS scholars' new blog post! Connor Kissane and @robertzzk showed that you can train sparse autoencoders on attention layers outputs, and it works great, giving sparse, interpretable features that allow deeper understanding.

Tweet media one

1

11

123

0

0

16

@ArthurConmy

Arthur Conmy

11 months

Cool paper! An interesting empirical question is whether the best jailbreaks on future models will be human-understandable or not. If they are human-understandable, there are a lot more strategies that could plausibly prevent them (e.g models supervising models)

@soroushjp

Soroush Pour

11 months

🧵📣New jailbreaks on SOTA LLMs. We introduce an automated, low-cost way to make transferable, black-box, plain-English jailbreaks for GPT-4, Claude-2, fine-tuned Llama. We elicit a variety of harmful text, incl. instructions for making meth & bombs.

Tweet media one

17

80

324

1

5

15

@ArthurConmy

Arthur Conmy

7 months

Finally, since Layer/RMS Norm set activations to lie on a hypersphere, the image of the hidden state in the logprobs is an ellipsoid. Using this fact, we can steal* unembedding matrices up to an orthogonal matrix! *Unfortunately the ellipse equation was not tractable (App. G)!

Tweet media one

1

0

16

@ArthurConmy

Arthur Conmy

10 months

@thesephist @AnthropicAI Btw, I would check some other samples rather than just max activating examples and zero activations (like e.g in Anthropic’s interface). This illusion effect holds even on diverse datasets (the paper only requires one page of reading iirc)

1

1

15

@ArthurConmy

Arthur Conmy

1 year

Really enjoyed the opportunity to talk about my research interests! Understanding models across their distribution may be possible, but we could fall short of this goal. Aiming at interpreting tasks and using automated tools could be how interpretability helps AI alignment

@labenz

Nathan Labenz

1 year

Excited to finally go deep on mechanistic interpretability – one of the more promising paths to long-term AI safety imho – with @ArthurConmy Listen to learn how "mechinterp" researchers are starting to scale their ability see inside LLMs

2

0

6

0

3

14

@ArthurConmy

Arthur Conmy

3 months

@TutorVals I'm working on a collaboration proposing a ton more interp research directions. In the meantime: - A subsection of ideas Neel mostly generated are in the paper (img with some of them attached) - -

Tweet media one

0

1

15

@ArthurConmy

Arthur Conmy

16 days

@thesophiaxu PhD decisions are mostly made by profs basically bidding for students and taking them because they personally want to work with the student. So -- maybe if the prof is particularly interested you could get a spot with them? Also, some unis have bad policies (e.g. Oxford and

2

0

15

@ArthurConmy

Arthur Conmy

11 months

@norabelrose @Blueyatagarasu Not helpful. I would guess that almost all ML researchers (besides the few who came from security) use blackbox in the uninterpretable sense. Eg they named a conference after this usage! ML people are surely the majority audience for your post

BlackboxNLP 2024

Workshop on analyzing and interpreting neural networks for NLP

blackboxnlp.github.io

1

0

14

@ArthurConmy

Arthur Conmy

1 year

Another good and short paper on AI progress and risk that has up-to-date takes on responsible scaling policies (which should be called "evaluation-gated" scaling policies IMO). Made using distill, too!

@geoffreyhinton

Geoffrey Hinton

@geoffreyhinton

1 year

New paper: Companies are planning to train models with 100x more computation than today’s state of the art, within 18 months. No one knows how powerful they will be. And there’s essentially no regulation on what they’ll be able to do with these models.

165

728

3K

0

0

13

@ArthurConmy

Arthur Conmy

2 months

@jackclarkSF Gemini 1.0 Ultra (from the arXiv paper, excuse shit phone)? Parameter count not reported though (:

Tweet media one

1

2

14

@ArthurConmy

Arthur Conmy

1 month

@charles0neill I would overall like to see more nonlinear SAE variants, but I think you missed the key upside of linear SAEs -- linear representations are a lot more usable in pretty much all applications: steering vectors / subspaces for unlearning / circuit analysis with linear attribution

1

0

13

@ArthurConmy

Arthur Conmy

7 months

@joshua_clymer I like this! Some more I've thought of: -Doing grunt work well is usually better than doing super clever things (scaling >>> clever architectures) -A lot of tasks can be started by pattern matching to existing work (few shot learning) -Update beliefs w/ weighted averages (Adam)

1

0

12

@ArthurConmy

Arthur Conmy

8 months

Wow, these new "Sydney" jailbreaks are wild

Tweet media one

1

0

12

@ArthurConmy

Arthur Conmy

7 months

Happy to have supervised the continuing effort pushing Attention SAEs to the limit!

@NeelNanda5

Neel Nanda

7 months

Cool post from my MATS scholars @Connor_Kissane & Rob Krzyzanowski: We Inspected Every Head in GPT-2 Small With SAEs So You Don't Have To! The features in an attn output most aligned with a head let's you get the head's "vibe". Rob, the madman, looked through all 144 heads!

Tweet media one

1

14

91

0

0

12

@ArthurConmy

Arthur Conmy

3 months

@teortaxesTex @MichaelTrazzi I've only been at GDM for 10 months lol

0

0

11

@ArthurConmy

Arthur Conmy

1 year

@michaelwhanna @coulispi @A_Variengien @kevrowan @davlindner @sheimersheim @jettjaniak How does ACDC automate interpretability? We consider Layer 0’s effect on Layer 1 independently of the effect Layer 0 has on Layer 2 (as introduced in ). Now we can measure the effect of any subset of edges (a circuit) in a neural network 3/N

1

0

10

@ArthurConmy

Arthur Conmy

10 months

@StephenLCasper <- explaining neurons leads to advexes / <- mech interp generates hypotheses with good predictive power <- a method that gets poor accuracy on neurons works really well on SAE features

Tweet card media

Language models can explain neurons in language models

We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and...

2

0

11

@ArthurConmy

Arthur Conmy

1 year

@Simeon_Cps If the model ‘thinks’ the probability of answer A is p and the probability of answer B is 1-p then to minimise E[-logprobs(correct)] it should put probability p on A and 1-p on B …

2

0

10

@ArthurConmy

Arthur Conmy

5 months

@aryaman2020 I'm currently interested in validating that SAE features are causally relevant intermediate variables, extending our early evidence here: This bottoms out in doing a real task on a model (activation steering) so this doesn't kick the can down the road.

[Full Post] Progress Update #1 from the GDM Mech Interp Team — AI Alignment Forum

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our ba…

www.alignmentforum.org

3

1

11

@ArthurConmy

Arthur Conmy

5 months

@main_horse uh-oh

1

0

10

@ArthurConmy

Arthur Conmy

10 months

@JacquesThibs Nice thought. My work has made me feel it’s more likely that intelligence and the “real source of capabilities” is in a huge stack of heuristics no one human could comprehend…1/2

2

0

10

@ArthurConmy

Arthur Conmy

16 days

Separating task and ordinal with ICA is cool and new, too, building off @euan_ong 's work on separating task and ordinal space

2

0

10

@ArthurConmy

Arthur Conmy

1 year

When writing research code fast I love being able to 1. zip(..., strict=True) # ensures same lengths 2. torch.einsum("bij,ijk->bjk", ...) # ensures correct shape lengths 3. my_tensor.item() # ensures single elements only what about ensuring tensors aren't silently broadcasting?

1

0

9

@ArthurConmy

Arthur Conmy

5 months

@S_OhEigeartaigh Probably inspired by Ilya's tweet; "I’m confident that OpenAI will build AGI that is both safe and beneficial"

@theojaffee

Theo

5 months

@liron None of these, except Kokotajlo, specifically cited OpenAI as lacking on x-risk. My bet is that if they really believed there was imminent x-risk, they would either stay at OpenAI and try to steer the ship from the inside, or ignore the NDA and whistleblow.

3

0

33

2

0

8

@ArthurConmy

Arthur Conmy

17 days

Also noticed here and here

@Miles_Brundage

Miles Brundage

@Miles_Brundage

17 days

Lol - Newsom’s letter says it is *bad* there’s a carveout for small models (which was intended as a proxy for small companies). Regardless of your views on the bill, CA Democrats do not seem to be trying particularly hard to coordinate + show there was some principle here.

14

15

249

0

0

9

@ArthurConmy

Arthur Conmy

11 months

@joshua_clymer IMO this overrates "research" and conceptual work (unless you only think AI policy is good work?) *A lot of ML research is packaging an idea into a bunch of experiments and a latex doc, and writing takeaways *A lot of conceptual work is good synthesis Both automatable

1

0

9

@ArthurConmy

Arthur Conmy

1 year

It's always good fun to dive deeper into the parts of research projects that are hard to communicate in main.tex. Thanks for the discussion @NeelNanda5 !

@NeelNanda5

Neel Nanda

1 year

New Paper Walkthrough: Automated Circuit Discovery A sign of mechanistic interpretability maturing as a field is methods becoming standardised and automated. @ArthurConmy 's work is a great distillation and initial step! We discuss his work and takeaways.

2

11

81

0

0

9

@ArthurConmy

Arthur Conmy

7 months

On the subject, also check out the awesome "Automatic Discovery of Visual Circuit" paper:

1

1

9

@ArthurConmy

Arthur Conmy

1 year

Joint work with @MavorParker @aengus_lynch1 @sheimersheim @AdriGarriga . Thanks to Redwood Research and @farairesearch for support. Towards understanding the capabilities of advanced machine learning models! N/N

0

0

9

@ArthurConmy

Arthur Conmy

7 months

@StephenLCasper Firstly, I'm pretty sure that OpenAI realises that this method doesn't reveal too much, is expensive in research time and actual money, so their mitigations involved tweaking logprobs returned rather than nuking any logprobs access.

1

0

9

@ArthurConmy

Arthur Conmy

1 year

Underrated concept relevant to AI: Punctuated Equilibria. In AI there tend to be periods of sudden rapid progress. E.g, in April 2022 PaLM, DALL-E 2 and SayCan were announced in the same week. AI progresses in discrete step changes (not linear progress or a single jump):

Tweet media one

1

0

9

@ArthurConmy

Arthur Conmy

1 year

@michaelwhanna @coulispi @A_Variengien @kevrowan @davlindner @sheimersheim @jettjaniak Specifically, we measure edge effects with "patching" experiments. Patching means counterfactually replacing a layer’s output on input 1 (x_orig) with its output on input 2 (x_new). Mechanistic interpretability studies edges bc they represent information flow through models 4/N

Tweet media one

1

0

9

@ArthurConmy

Arthur Conmy

2 months

@aryaman2020 I sure hope not. Dude patents everything he does

1

0

9

@ArthurConmy

Arthur Conmy

3 months

@casebash @MichaelTrazzi Yes, and also the AI took days rather than the strict 4.5 hours for the contestants. We didn’t operationalise exact setup but it would be unreasonable to suggest Michael hasn’t won the bet now

0

0

9

@ArthurConmy

Arthur Conmy

10 months

@nickcammarata Curious why you think mech interp is the highest leverage field in safety. If you think something big will happen soon, isn't generating evals for dangerous capabilities higher priority than basic science? (I do mech interp, but think about other things I could do sometimes)

1

0

8

@ArthurConmy

Arthur Conmy

7 months

@aryaman2020 @jeffreygwang To be clear, this paper required many independent, original steps since CLIP and particularly Inception don't work nicely in the naive transformer circuits frame. Further, auto-discovering car circuits and actually using circuit discovery for advexes are both awesome!

1

0

8

@ArthurConmy

Arthur Conmy

1 year

Life imitates memes ( )

Tweet card media

US presidents rate AI alignment agendas

0:00 Intro1:21 Superposition3:44 Circuit interp6:30 Agent foundations (MIRI)12:19 CoEms (Conjecture)14:40 Infrabayesianism15:19 Shard theory19:25 Eliciting l...

www.youtube.com

@AndyMasley

Andy Masley

1 year

Did not expect to see the day when Obama shouted out the Alignment Research Center

Tweet media one

2

11

143

0

0

8

@ArthurConmy

Arthur Conmy

3 months

@banburismus_ Personally, I'd probably be compelled if someone took and showed that there's some decoder feature f in a residual stream SAE such that all MLP out and attention out SAE decoder features before f have cosine sim <0.75 with f

Tweet card media

google/gemma-scope · Hugging Face

3

0

8

@ArthurConmy

Arthur Conmy

7 months

@Connor_Kissane @robertzzk @NeelNanda5 @JBloomAus During MATS, I worked with @calsmcdougall and @starship006_ rigorously interpreting model components across the whole training distribution:

@calsmcdougall

Callum McDougall

1 year

Can we reverse-engineer LLM components across the whole data distribution? Yes! We find a GPT-2 attention head that "copy suppresses": it responds to naive copying from earlier layers and suppresses it. This is almost the *only* thing the head does! 🤯🧵

Tweet media one

4

59

306

1

0

7

@ArthurConmy

Arthur Conmy

2 months

@_xjdr I like to think of the chat sys prompting style <You are an incredibly smart X, and you will help me do Y…> as the continuation of getting base models to simulate the smart output. And thinking about how to get the model back into base model mode is so important for jailbreaks

1

0

8

@ArthurConmy

Arthur Conmy

5 months

Stare deep into the eye of sauron and do not blink.

0

0

8

@ArthurConmy

Arthur Conmy

7 months

@aryaman2020 Don’t forget the model GPT-2 small, that you must load in code by writing “gpt2-small”. The hyphen is always irritating

0

0

8

@ArthurConmy

Arthur Conmy

1 year

We iteratively apply patching through a computational graph of a model to find the edges that have large effects on model performance when patched. We keep these edges so that they make up a sparse computational subgraph (e.g just 68/32000 edges in the Greater-Than circuit) 5/N

Tweet media one

1

0

7

@ArthurConmy

Arthur Conmy

5 months

@aryaman2020 @StephenLCasper 's criticism seems different, it's mostly a claim that SAEs will not be competitive with other techniques. This is a possible outcome of the research here ofc!

0

0

6

@ArthurConmy

Arthur Conmy

7 months

@StephenLCasper Thirdly I didn't help on this work because I think this is a crucial message for AI safety (though other authors may think this!). I like doing interesting research :) I'd feel sad if this paper in particular impacts open access to models for safety, and tell me if you see this

1

0

7

@ArthurConmy

Arthur Conmy

10 months

@jxmnop I don’t think “how many open questions are there in field X” tracks “how fast is field X moving” well in general. Also I read these all and only three of the ~20 (long text, limits of training data, multimodality) seem plausible AGI blockers in a post-GPT4 world

1

0

6

@ArthurConmy

Arthur Conmy

7 months

@StephenLCasper Secondly, I do think that there are extensions of this work that could steal many more facts. I have three important facts about models in mind that are pluasibly stealable but won't mention them in case we build on this work (but will say in a month if nothing happens!)

1

0

7

@ArthurConmy

Arthur Conmy

5 months

@idavidrein @typedfemale Thanks for addressing criticism, but why do you think that "... you can just use that fraction [estimated correct evaluation datapoints as a fraction] as a ceiling that you don’t measure performance beyond"? An eval with 74% signal and 26% noise will become bad before 74% acc ...

1

0

5

@ArthurConmy

Arthur Conmy

1 year

There's a lot of discussion about adversarial attacks on generative models (such as prompt injection). It turns out that there exist adversarial attacks against real-world RL policies

@farairesearch

FAR.AI

1 year

This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! 🧵

Tweet media one

8

89

470

1

0

7

@ArthurConmy

Arthur Conmy

1 year

I’m at my first in-person conference 🇷🇼! DM if you’d like to talk about Interpretability in the Wild, follow up work or your thoughts on how to understand LLMs - I’d love to meet!

Tweet card media

Towards Automated Circuit Discovery for Mechanistic Interpretability

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process...

0

0

7

@ArthurConmy

Arthur Conmy

6 months

@ericjmichaud_ Most of our (current) compute is spent on getting internal LLM activations. I haven’t benchmarked how fast this is on GPUs. We say batch sizes and training steps in the paper (total = ~1 billion activations)

3

0

7

@ArthurConmy

Arthur Conmy

8 months

@sebkrier I would wait till the full version of the paper is out. It looked very rushed and the README shows a bunch of cases where their evals have false negatives. Another issue may be that the instruct-tuned models are formatting outputs in a way that their eval does not understand.

1

0

6

@ArthurConmy

Arthur Conmy

1 year

We're in the early stages of a sudden rush of AI governance progress. Several open letters/papers, all the big AI companies commenting on RSPs, ai dot gov launch, US AI safety institute launch, and the AI safety summit is just starting!

1

0

7

@ArthurConmy

Arthur Conmy

1 year

Where does ACDC fit into automated interpretability work? @OpenAI used LMs to understand components in smaller LMs. @atticus_geiger and @ZhengxuanZenWu 's DAS can test whether model components implement causal models. ACDC could find the components these methods test! 6/N

Tweet media one

Tweet media two

1

0

7

@ArthurConmy

Arthur Conmy

7 months

@akbirkhan

Tweet card media

Berkson's paradox - Wikipedia

en.wikipedia.org

0

0

6

@ArthurConmy

Arthur Conmy

9 months

Great paper! There is so much structure inside neural networks.

@wesg52

Wes Gurnee

9 months

New paper! "Universal Neurons in GPT2 Language Models" How many neurons are independently meaningful? How many neurons reappear across models with different random inits? Do these neurons specialize into specific functional roles or form feature families? Answers below 🧵:

Tweet media one

7

67

407

0

0

5

@ArthurConmy

Arthur Conmy

1 year

This was great fun @labenz @CogRev_Podcast ! Keep an eye out for the release 👀

@labenz

Nathan Labenz

1 year

So excited to interview @ArthurConmy about mechanizing mechanistic interpretability for @CogRev_Podcast tomorrow that I wrote 1100+ words of questions Let me know what I missed 😂

0

0

4

0

1

6

@ArthurConmy

Arthur Conmy

16 days

1

0

6