Arthur Conmy Profile Banner
Arthur Conmy Profile
Arthur Conmy

@ArthurConmy

1,990
Followers
1,087
Following
25
Media
327
Statuses

Aspiring 10x reverse engineer @GoogleDeepMind

London, England
Joined August 2021
Don't wanna be here? Send us removal request.
Pinned Tweet
@ArthurConmy
Arthur Conmy
3 months
I have more Sparse Autoencoder research ideas than I’ve ever had before with other topics. We hope you can try new projects with these open SAEs!
@NeelNanda5
Neel Nanda
3 months
Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training costs limit research Announcing Gemma Scope: An open suite of SAEs on every layer & sublayer of Gemma 2 2B & 9B! We hope to enable even more ambitious work
18
147
979
1
0
88
@ArthurConmy
Arthur Conmy
10 months
Excited to announce that I’ve joined @GoogleDeepMind scalable alignment team, scaling interpretability!
Tweet media one
16
7
507
@ArthurConmy
Arthur Conmy
1 year
How can we speed up Mechanistic Interpretability? Researchers spend a lot of time searching for the internal model components that matter. We introduce the Automatic Circuit DisCovery (ACDC) ⚡ algorithm! 1/N 🧵
Tweet media one
4
42
299
@ArthurConmy
Arthur Conmy
3 months
fuck sake, just lost a $50 bet from July 2022 with @MichaelTrazzi that AI wouldn’t get an IMO silver before 2025. It got one point off a gold…
@GoogleDeepMind
Google DeepMind
3 months
We’re presenting the first AI to solve International Mathematical Olympiad problems at a silver medalist level.🥈 It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system. 🧵
301
1K
5K
5
3
231
@ArthurConmy
Arthur Conmy
7 months
How much can you steal from an LLM API that returns logprobs? 🧵 In our new paper, collaborators noticed that the LLM vocab size is always bigger than the hidden dimension, so logprobs lie inside a hidden-dimension sized subspace, so we can steal that dimension.
Tweet media one
6
17
204
@ArthurConmy
Arthur Conmy
16 days
comforting that Anthropic are working on top of my rushed papers from last year's ICLR while I rush this year's ICLR papers :))
1
1
172
@ArthurConmy
Arthur Conmy
17 days
Newsom: we also need to regulate small models and companies Pelosi: thanks for not regulating small models and companies ???
@SpeakerPelosi
Nancy Pelosi
17 days
AI springs from California. Thank you, @CAgovernor Newsom, for recognizing the opportunity and responsibility we all share to enable small entrepreneurs and academia – not big tech – to dominate.
186
124
666
7
5
170
@ArthurConmy
Arthur Conmy
5 months
I heard several takes in the past few days that safety folk leaving OpenAI was evidence that they trusted OpenAI and safety was going well. Seems unlikely given this thread.
@janleike
Jan Leike
5 months
I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.
48
498
4K
8
4
141
@ArthurConmy
Arthur Conmy
1 year
⚡ACDC was accepted as a *spotlight* at NeurIPS 2023! 📜 Paper (updated today): With @MavorParker @aengus_lynch1 @sheimersheim @AdriGarriga
@ArthurConmy
Arthur Conmy
1 year
How can we speed up Mechanistic Interpretability? Researchers spend a lot of time searching for the internal model components that matter. We introduce the Automatic Circuit DisCovery (ACDC) ⚡ algorithm! 1/N 🧵
Tweet media one
4
42
299
3
7
95
@ArthurConmy
Arthur Conmy
2 months
I’m working with Neel once more mentoring researchers doing mechanistic interpretability work. I’d be grateful if you applied, or pass it on to someone you know who could benefit from this!
@NeelNanda5
Neel Nanda
2 months
The deadline to apply to my and @ArthurConmy 's MATS streams is Aug 30, in 11 days! If you want to transition into mechanistic interpretability research, or accelerate your work if you're already in the field, I'd be excited to get your application. All backgrounds welcome!
3
11
70
2
4
60
@ArthurConmy
Arthur Conmy
6 months
We’re excited to share Gated SAEs: an improvement to Sparse Autoencoder training that scales to 7B parameter models (at least!)
@sen_r
Senthooran Rajamanoharan
6 months
New @GoogleDeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders. They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ @ArthurConmy
Tweet media one
5
26
177
0
3
60
@ArthurConmy
Arthur Conmy
6 months
An update on our work on SAEs. Stay tuned for our upcoming SAE Pareto improvement too… :)
@NeelNanda5
Neel Nanda
6 months
Announcing a progress update from the @GoogleDeepMind mech interp team! Inspired by @AnthropicAI 's excellent monthly updates, we share a range of updates on our work on Sparse Autoencoders, from signs of life on interpreting steering vectors with SAEs to improving ghost grads.
Tweet media one
4
41
387
1
3
55
@ArthurConmy
Arthur Conmy
2 months
Sparse Autoencoders have millions of features that often correspond to human concepts: we need automation to label them. @AnthropicAI @AiEleuther both use a smarter model to label the max activations, but here @dmhook + @neverrixx use the model itself (Like @ghandeharioun et al)
@dmhook
Dmitrii Kharlapenko
2 months
We use LLM’s capabilities to explain concepts from their minds in my and @neverrixx abstract SAE features research. Excited to continue our MATS 6.0 work under the mentorship of @NeelNanda5 and @ArthurConmy . More cool stuff to come!
4
8
70
1
3
53
@ArthurConmy
Arthur Conmy
7 months
I'll be supervising Mechanistic Interpretability at this summer's MATS program-fill in this short airtable form by Sunday (24th March)! From my mentee and (co-)mentor experience, it's a great program to do important work understanding how AIs work!
6
7
52
@ArthurConmy
Arthur Conmy
10 months
I appreciate how this work gives a detailed account of mech interp failing, and why. Seems great for getting a realistic picture of what working on challenging research is like. Plus, prior work seems to have been on the right track (eg Geva et al, Hernandez et al)
@NeelNanda5
Neel Nanda
10 months
My first @GoogleDeepMind project: How do LLMs recall facts? Early MLP layers act as a lookup table, with significant superposition! They recognise entities and produce their attributes as directions. We suggest viewing fact recall as a black box making "multi-token embeddings”
Tweet media one
7
154
1K
0
2
49
@ArthurConmy
Arthur Conmy
3 months
Our GDM mech interp team just gave access to SAEs for all sites and layers in Gemma-2 9B to people who filled in the form. Come use @lieberum_t and @sen_r and team’s great work!
@NeelNanda5
Neel Nanda
3 months
I'm very excited for the open source release! SAEs are a very promising technique, but there's a LOT of work to do and we need the help. But they're expensive to train, so most work happens in industry labs or on tiny models. Apply for 9B early access:
1
0
21
1
2
40
@ArthurConmy
Arthur Conmy
1 year
I recently finished MATS in Neel's stream and highly, highly recommend. The best mentors are both technically proficient and great at teaching, and Neel has both (look at his research and MechInterp materials!). Neel helped from start-end with our paper
@NeelNanda5
Neel Nanda
1 year
Do you want to do @ch402 -style mechanistic interpretability research? I'm looking for scholars to mentor via MATS - apply by Nov 10th! I've had a lot of fun with past scholars, and they've done fantastic peer-reviewed work, I'm excited for my next batch!
7
36
177
0
4
40
@ArthurConmy
Arthur Conmy
17 days
notebooklm giveth, and notebooklm taketh away
Tweet media one
Tweet media two
0
1
37
@ArthurConmy
Arthur Conmy
1 year
Want to use ACDC? Use our code (built on top of @NeelNanda5 's TransformerLens) here: Want to use a notebook? See Want to dig into the ACDC algorithm and see the results? See the updated arXiv paper 7/N
1
4
36
@ArthurConmy
Arthur Conmy
7 months
This is a great way to do an interpretability investigation. A lot in this paper.
@saprmarks
Samuel Marks
7 months
Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager , @ericjmichaud_ , @boknilev , @davidbau , @amuuueller
7
61
310
0
3
32
@ArthurConmy
Arthur Conmy
3 months
Stealing part of a production language model won an ICML best paper award! And shoutout to @itay__yona who led a demonstration of 'stealing' the Pythia-14M unembedding matrix up to orthogonality in the updated arXiv version!
@rsalakhu
Russ Salakhutdinov
3 months
ICML 2024: Best Paper Awards: Florian Tramèr; Gautam Kamath; Nicholas Carlini: Considerations for Differentially Private Learning with Large-Scale Public Pretraining Akbir Khan; John Hughes; Dan Valentine; Laura Ruis; Kshitij Sachan; Ansh Radhakrishnan; Edward Grefenstette;
1
12
56
1
0
32
@ArthurConmy
Arthur Conmy
3 months
One concern about SAE research is that it won't work transfer well to instruction-tuned chat models, or will require prohibitively expensive re-training for these new models. Excitingly, we find that SAEs trained on the base model transfer extremely well to a range of IT models.
@Connor_Kissane
Connor Kissane
3 months
New post with @robertzzk , @ArthurConmy , & @NeelNanda5 : Sparse Autoencoders (usually) Transfer between Base and Chat Models! This suggests that models' representations remain extremely similar after fine-tuning.
Tweet media one
3
4
42
0
3
32
@ArthurConmy
Arthur Conmy
17 days
This was another interesting finding from @Connor_Kissane : - **Base** models trained today refuse like chat models: "I'm sorry, but I cannot provide assistance..." - This even happened before ChatGPT, but it's a bit different, e.g. Llama-1 says "I'm sorry, I'm not familar..."
@Connor_Kissane
Connor Kissane
17 days
New post w/ @ArthurConmy & @NeelNanda5 : Base LLMs refuse too! Just at a lower rate than chat models. This implies that refusing harmful requests is not a novel behavior learned during chat fine-tuning, contrary to popular belief.
Tweet media one
8
8
68
2
0
31
@ArthurConmy
Arthur Conmy
2 months
The technical report to Gemma Scope is on arXiv:
@_akhaliq
AK
2 months
Gemma Scope Open Sparse Autoencoders Everywhere All At Once on Gemma 2 discuss: Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features.
1
19
119
1
2
28
@ArthurConmy
Arthur Conmy
9 months
This seems like a paper that could slip under the radar of AI safety researchers but it seems an important contribution for monitoring. 👀In CoT *generation* logical steps are easier than attribution (extracting info from prompt), but in CoT *verification* attribution is easier
@alon_jacovi
Alon Jacovi
9 months
👋 Check out our new paper and benchmark: REVEAL, a dataset with step-by-step correctness labels for chain-of-thought reasoning in open-domain QA 🧵🧵🧵
Tweet media one
3
36
180
2
4
25
@ArthurConmy
Arthur Conmy
4 months
The success of Attention SAEs was a sizable update that sparse linear representations really closely match how LLMs work inside. Congrats @Connor_Kissane @robertzzk ! Training them Just Works, and they improve circuit analysis.
@NeelNanda5
Neel Nanda
4 months
Great post from my scholars @Connor_Kissane & @robertzzk ! SAEs are fashionable, but are they a useful tool for researchers? They are! We find a deeper understanding of the well-studied IOI circuit, and make a circuit analysis tool $1000 bounty to whoever finds the best circuit!
2
11
73
0
2
26
@ArthurConmy
Arthur Conmy
10 months
I’ll be presenting our interpretability work (w/ coauthors) in just under two hours! Come say hello :)
@farairesearch
FAR.AI
10 months
🌟⚡🔍 #NeurIPS2023 Spotlight Poster: Discover how the ACDC algorithm skillfully identifies essential model components. Join the session on "Towards Automated Circuit DisCovery for Mechanistic Interpretability" on Dec 12, 5:15 PM CST poster #1503 by @AdriGarriga & team.
0
2
6
2
0
24
@ArthurConmy
Arthur Conmy
7 months
The dogit lens 🐶 and ViT interpretability!
@soniajoseph_
Sonia 🌻
7 months
I'm excited to release Prisma, a mechanistic interpretability library for multimodal models like CLIP and ViTs. Incubated at @tyrell_turing 's lab & in collab with @NeelNanda5 . Recent mech interp work has focused on language, but many techniques transfer. Behold, the dogit lens:
Tweet media one
14
80
506
2
1
23
@ArthurConmy
Arthur Conmy
7 months
Regardless, using gradient descent seems a plausible method that could steal unembedding matrices from logprobs up to an orthogonal matrix, could you get it to work? Paper link: Important context for using these attacks:
@brianryhuang
Brian Huang
8 months
did they just undercut openlogprobs? 😭
Tweet media one
3
6
105
0
1
23
@ArthurConmy
Arthur Conmy
7 months
I was really impressed when I first saw the alignment team's results in this paper, e.g. the cmd_injection transcript from Appendix D.2 where Gemini 1.0 successfully finds an exploit where bash commands can be injected into URLs 😯
Tweet media one
@tshevl
Toby
7 months
In 2024, the AI community will develop more capable AI systems than ever before. How do we know what new risks to protect against, and what the stakes are? Our research team at @GoogleDeepMind built a set of evaluations to measure potentially dangerous capabilities: 🧵
Tweet media one
8
47
235
1
1
23
@ArthurConmy
Arthur Conmy
1 year
“We present a single attention head in GPT-2 Small that has one main role across the entire training distribution … this is the most comprehensive description of the complete role of a component in a language model to date.” Excited to share our new paper!
@calsmcdougall
Callum McDougall
1 year
Can we reverse-engineer LLM components across the whole data distribution? Yes! We find a GPT-2 attention head that "copy suppresses": it responds to naive copying from earlier layers and suppresses it. This is almost the *only* thing the head does! 🤯🧵
Tweet media one
4
59
306
0
1
23
@ArthurConmy
Arthur Conmy
2 months
Tweet media one
1
1
22
@ArthurConmy
Arthur Conmy
8 months
I like this work from colleagues! 1. Gradient-based patching is unreasonably effective 2. Gradient-based patching bakes in assumptions about linearity which *almost* always hold. I think the fix to take into account how attention softmax is not linear is cool (post 2 in thread)
@JanosKramar
János Kramár
8 months
New @GoogleDeepmind mech interp research! 🎉 Can we massively speed up the process of finding important nodes in LLMs? Yes! Introducing AtP*, an improved variant of Attribution Patching (AtP) that beats all our baselines on efficiency and effectiveness. 🧵
Tweet media one
3
32
155
0
0
21
@ArthurConmy
Arthur Conmy
1 year
We’ve developed a method that is much faster than my previous work on automatically finding interpretable circuits in models. Great work from @aaquib_syed1 and @can_rager ! Explanation and surprising discoveries in this🧵
@aaquib_syed1
Aaquib Syed
1 year
🔬 Interpretability could unravel the mysteries of LLMs! Our latest research finds an extremely efficient algorithm that generally outperforms existing work in identifying key subnetworks (circuits). Link . Explanation 👇🧵
Tweet media one
2
25
171
1
1
21
@ArthurConmy
Arthur Conmy
6 months
We just uploaded an update to our Gated SAEs paper to arXiv: We added more points to graphs and ran eval on a fix set of tokens, and added numbers in the appendices.
Tweet media one
@sen_r
Senthooran Rajamanoharan
6 months
To show this works at scale, we train a range of SAEs (both normal and Gated) up to Gemma 7B on a range of layers and attention, MLP and residual activations, in the process showing it's practical to scale SAEs to 7B. Gated SAEs are consistently a Pareto improvement.
Tweet media one
1
2
11
0
1
20
@ArthurConmy
Arthur Conmy
3 months
Train 24k feature SAEs on context length 128 embeddings and they only learn ~50 features that represent position embeddings and nothing else. My guess is that the longer tail of ~150 positional features are ANDs of positions and some tokens (L1 training creates AND features).
@bilalchughtai_
Bilal Chughtai 🇵🇸
3 months
Very short informal research post investigating positional SAE features and SAE length generalization.
0
7
71
1
0
20
@ArthurConmy
Arthur Conmy
10 months
So happy this work is out! I think this thread summarizes our paper well :) A really nice mechanistic interpretability result; structure that recurs across model size and in different numeric tokens.
@rgould0
Rhys Gould
10 months
How do LLMs represent sequences? We find they use a single attention head to map ‘one’ to ‘two’, ‘Tuesday’ to ‘Wednesday’, and ‘June’ to ‘July’! 🧵➡️Our new mechinterp paper studies these ‘successor heads’, and finds they act on shared numerical structure in token embeddings!
Tweet media one
3
25
149
1
0
19
@ArthurConmy
Arthur Conmy
1 year
We used ACDC (as well as existing methods) to rederive the structures of circuits from previous works with no human-in-the-loop! E.g Greater-Than ( @michaelwhanna + @coulispi + @A_Variengien ), IOI ( @kevrowan et al.), tracr ( @davlindner et al.), Docstrings ( @sheimersheim + @jettjaniak )
Tweet media one
1
3
18
@ArthurConmy
Arthur Conmy
7 months
We can get more than this! If a model uses a LayerNorm before the unembedding, it projects the hidden state onto a subspace that is one dimension smaller. We can detect this by centering the output (see Appendix B.2), and we show how to steal whether models use LayerNorm/RMSNorm!
Tweet media one
1
0
16
@ArthurConmy
Arthur Conmy
9 months
Rob and Connor did a great job on this first post and if you're into Sparse Autoencoders, I'd highly recommend following their ongoing work understanding attention!
@NeelNanda5
Neel Nanda
9 months
Thanks to @AnthropicAI for the shout out to my MATS scholars' new blog post! Connor Kissane and @robertzzk showed that you can train sparse autoencoders on attention layers outputs, and it works great, giving sparse, interpretable features that allow deeper understanding.
Tweet media one
1
11
123
0
0
16
@ArthurConmy
Arthur Conmy
11 months
Cool paper! An interesting empirical question is whether the best jailbreaks on future models will be human-understandable or not. If they are human-understandable, there are a lot more strategies that could plausibly prevent them (e.g models supervising models)
@soroushjp
Soroush Pour
11 months
🧵📣New jailbreaks on SOTA LLMs. We introduce an automated, low-cost way to make transferable, black-box, plain-English jailbreaks for GPT-4, Claude-2, fine-tuned Llama. We elicit a variety of harmful text, incl. instructions for making meth & bombs.
Tweet media one
17
80
324
1
5
15
@ArthurConmy
Arthur Conmy
7 months
Finally, since Layer/RMS Norm set activations to lie on a hypersphere, the image of the hidden state in the logprobs is an ellipsoid. Using this fact, we can steal* unembedding matrices up to an orthogonal matrix! *Unfortunately the ellipse equation was not tractable (App. G)!
Tweet media one
1
0
16
@ArthurConmy
Arthur Conmy
10 months
@thesephist @AnthropicAI Btw, I would check some other samples rather than just max activating examples and zero activations (like e.g in Anthropic’s interface). This illusion effect holds even on diverse datasets (the paper only requires one page of reading iirc)
1
1
15
@ArthurConmy
Arthur Conmy
1 year
Really enjoyed the opportunity to talk about my research interests! Understanding models across their distribution may be possible, but we could fall short of this goal. Aiming at interpreting tasks and using automated tools could be how interpretability helps AI alignment
@labenz
Nathan Labenz
1 year
Excited to finally go deep on mechanistic interpretability – one of the more promising paths to long-term AI safety imho – with @ArthurConmy Listen to learn how "mechinterp" researchers are starting to scale their ability see inside LLMs
2
0
6
0
3
14
@ArthurConmy
Arthur Conmy
3 months
@TutorVals I'm working on a collaboration proposing a ton more interp research directions. In the meantime: - A subsection of ideas Neel mostly generated are in the paper (img with some of them attached) - -
Tweet media one
0
1
15
@ArthurConmy
Arthur Conmy
16 days
@thesophiaxu PhD decisions are mostly made by profs basically bidding for students and taking them because they personally want to work with the student. So -- maybe if the prof is particularly interested you could get a spot with them? Also, some unis have bad policies (e.g. Oxford and
2
0
15
@ArthurConmy
Arthur Conmy
11 months
@norabelrose @Blueyatagarasu Not helpful. I would guess that almost all ML researchers (besides the few who came from security) use blackbox in the uninterpretable sense. Eg they named a conference after this usage! ML people are surely the majority audience for your post
1
0
14
@ArthurConmy
Arthur Conmy
1 year
Another good and short paper on AI progress and risk that has up-to-date takes on responsible scaling policies (which should be called "evaluation-gated" scaling policies IMO). Made using distill, too!
@geoffreyhinton
Geoffrey Hinton
1 year
New paper: Companies are planning to train models with 100x more computation than today’s state of the art, within 18 months. No one knows how powerful they will be. And there’s essentially no regulation on what they’ll be able to do with these models.
165
728
3K
0
0
13
@ArthurConmy
Arthur Conmy
2 months
@jackclarkSF Gemini 1.0 Ultra (from the arXiv paper, excuse shit phone)? Parameter count not reported though (:
Tweet media one
1
2
14
@ArthurConmy
Arthur Conmy
1 month
@charles0neill I would overall like to see more nonlinear SAE variants, but I think you missed the key upside of linear SAEs -- linear representations are a lot more usable in pretty much all applications: steering vectors / subspaces for unlearning / circuit analysis with linear attribution
1
0
13
@ArthurConmy
Arthur Conmy
7 months
@joshua_clymer I like this! Some more I've thought of: -Doing grunt work well is usually better than doing super clever things (scaling >>> clever architectures) -A lot of tasks can be started by pattern matching to existing work (few shot learning) -Update beliefs w/ weighted averages (Adam)
1
0
12
@ArthurConmy
Arthur Conmy
8 months
Wow, these new "Sydney" jailbreaks are wild
Tweet media one
1
0
12
@ArthurConmy
Arthur Conmy
7 months
Happy to have supervised the continuing effort pushing Attention SAEs to the limit!
@NeelNanda5
Neel Nanda
7 months
Cool post from my MATS scholars @Connor_Kissane & Rob Krzyzanowski: We Inspected Every Head in GPT-2 Small With SAEs So You Don't Have To! The features in an attn output most aligned with a head let's you get the head's "vibe". Rob, the madman, looked through all 144 heads!
Tweet media one
1
14
91
0
0
12
@ArthurConmy
Arthur Conmy
3 months
@teortaxesTex @MichaelTrazzi I've only been at GDM for 10 months lol
0
0
11
@ArthurConmy
Arthur Conmy
1 year
@michaelwhanna @coulispi @A_Variengien @kevrowan @davlindner @sheimersheim @jettjaniak How does ACDC automate interpretability? We consider Layer 0’s effect on Layer 1 independently of the effect Layer 0 has on Layer 2 (as introduced in ). Now we can measure the effect of any subset of edges (a circuit) in a neural network 3/N
1
0
10
@ArthurConmy
Arthur Conmy
10 months
@StephenLCasper <- explaining neurons leads to advexes / <- mech interp generates hypotheses with good predictive power <- a method that gets poor accuracy on neurons works really well on SAE features
2
0
11
@ArthurConmy
Arthur Conmy
1 year
@Simeon_Cps If the model ‘thinks’ the probability of answer A is p and the probability of answer B is 1-p then to minimise E[-logprobs(correct)] it should put probability p on A and 1-p on B …
2
0
10
@ArthurConmy
Arthur Conmy
5 months
@aryaman2020 I'm currently interested in validating that SAE features are causally relevant intermediate variables, extending our early evidence here: This bottoms out in doing a real task on a model (activation steering) so this doesn't kick the can down the road.
3
1
11
@ArthurConmy
Arthur Conmy
5 months
1
0
10
@ArthurConmy
Arthur Conmy
10 months
@JacquesThibs Nice thought. My work has made me feel it’s more likely that intelligence and the “real source of capabilities” is in a huge stack of heuristics no one human could comprehend…1/2
2
0
10
@ArthurConmy
Arthur Conmy
16 days
Separating task and ordinal with ICA is cool and new, too, building off @euan_ong 's work on separating task and ordinal space
2
0
10
@ArthurConmy
Arthur Conmy
1 year
When writing research code fast I love being able to 1. zip(..., strict=True) # ensures same lengths 2. torch.einsum("bij,ijk->bjk", ...) # ensures correct shape lengths 3. my_tensor.item() # ensures single elements only what about ensuring tensors aren't silently broadcasting?
1
0
9
@ArthurConmy
Arthur Conmy
5 months
@S_OhEigeartaigh Probably inspired by Ilya's tweet; "I’m confident that OpenAI will build AGI that is both safe and beneficial"
@theojaffee
Theo
5 months
@liron None of these, except Kokotajlo, specifically cited OpenAI as lacking on x-risk. My bet is that if they really believed there was imminent x-risk, they would either stay at OpenAI and try to steer the ship from the inside, or ignore the NDA and whistleblow.
3
0
33
2
0
8
@ArthurConmy
Arthur Conmy
17 days
Also noticed here and here
@Miles_Brundage
Miles Brundage
17 days
Lol - Newsom’s letter says it is *bad* there’s a carveout for small models (which was intended as a proxy for small companies). Regardless of your views on the bill, CA Democrats do not seem to be trying particularly hard to coordinate + show there was some principle here.
14
15
249
0
0
9
@ArthurConmy
Arthur Conmy
11 months
@joshua_clymer IMO this overrates "research" and conceptual work (unless you only think AI policy is good work?) *A lot of ML research is packaging an idea into a bunch of experiments and a latex doc, and writing takeaways *A lot of conceptual work is good synthesis Both automatable
1
0
9
@ArthurConmy
Arthur Conmy
1 year
It's always good fun to dive deeper into the parts of research projects that are hard to communicate in main.tex. Thanks for the discussion @NeelNanda5 !
@NeelNanda5
Neel Nanda
1 year
New Paper Walkthrough: Automated Circuit Discovery A sign of mechanistic interpretability maturing as a field is methods becoming standardised and automated. @ArthurConmy 's work is a great distillation and initial step! We discuss his work and takeaways.
2
11
81
0
0
9
@ArthurConmy
Arthur Conmy
7 months
On the subject, also check out the awesome "Automatic Discovery of Visual Circuit" paper:
1
1
9
@ArthurConmy
Arthur Conmy
1 year
Joint work with @MavorParker @aengus_lynch1 @sheimersheim @AdriGarriga . Thanks to Redwood Research and @farairesearch for support. Towards understanding the capabilities of advanced machine learning models! N/N
0
0
9
@ArthurConmy
Arthur Conmy
7 months
@StephenLCasper Firstly, I'm pretty sure that OpenAI realises that this method doesn't reveal too much, is expensive in research time and actual money, so their mitigations involved tweaking logprobs returned rather than nuking any logprobs access.
1
0
9
@ArthurConmy
Arthur Conmy
1 year
Underrated concept relevant to AI: Punctuated Equilibria. In AI there tend to be periods of sudden rapid progress. E.g, in April 2022 PaLM, DALL-E 2 and SayCan were announced in the same week. AI progresses in discrete step changes (not linear progress or a single jump):
Tweet media one
1
0
9
@ArthurConmy
Arthur Conmy
1 year
@michaelwhanna @coulispi @A_Variengien @kevrowan @davlindner @sheimersheim @jettjaniak Specifically, we measure edge effects with "patching" experiments. Patching means counterfactually replacing a layer’s output on input 1 (x_orig) with its output on input 2 (x_new). Mechanistic interpretability studies edges bc they represent information flow through models 4/N
Tweet media one
1
0
9
@ArthurConmy
Arthur Conmy
2 months
@aryaman2020 I sure hope not. Dude patents everything he does
1
0
9
@ArthurConmy
Arthur Conmy
3 months
@casebash @MichaelTrazzi Yes, and also the AI took days rather than the strict 4.5 hours for the contestants. We didn’t operationalise exact setup but it would be unreasonable to suggest Michael hasn’t won the bet now
0
0
9
@ArthurConmy
Arthur Conmy
10 months
@nickcammarata Curious why you think mech interp is the highest leverage field in safety. If you think something big will happen soon, isn't generating evals for dangerous capabilities higher priority than basic science? (I do mech interp, but think about other things I could do sometimes)
1
0
8
@ArthurConmy
Arthur Conmy
7 months
@aryaman2020 @jeffreygwang To be clear, this paper required many independent, original steps since CLIP and particularly Inception don't work nicely in the naive transformer circuits frame. Further, auto-discovering car circuits and actually using circuit discovery for advexes are both awesome!
1
0
8
@ArthurConmy
Arthur Conmy
3 months
@banburismus_ Personally, I'd probably be compelled if someone took and showed that there's some decoder feature f in a residual stream SAE such that all MLP out and attention out SAE decoder features before f have cosine sim <0.75 with f
3
0
8
@ArthurConmy
Arthur Conmy
7 months
@Connor_Kissane @robertzzk @NeelNanda5 @JBloomAus During MATS, I worked with @calsmcdougall and @starship006_ rigorously interpreting model components across the whole training distribution:
@calsmcdougall
Callum McDougall
1 year
Can we reverse-engineer LLM components across the whole data distribution? Yes! We find a GPT-2 attention head that "copy suppresses": it responds to naive copying from earlier layers and suppresses it. This is almost the *only* thing the head does! 🤯🧵
Tweet media one
4
59
306
1
0
7
@ArthurConmy
Arthur Conmy
2 months
@_xjdr I like to think of the chat sys prompting style <You are an incredibly smart X, and you will help me do Y…> as the continuation of getting base models to simulate the smart output. And thinking about how to get the model back into base model mode is so important for jailbreaks
1
0
8
@ArthurConmy
Arthur Conmy
5 months
Stare deep into the eye of sauron and do not blink.
0
0
8
@ArthurConmy
Arthur Conmy
7 months
@aryaman2020 Don’t forget the model GPT-2 small, that you must load in code by writing “gpt2-small”. The hyphen is always irritating
0
0
8
@ArthurConmy
Arthur Conmy
1 year
We iteratively apply patching through a computational graph of a model to find the edges that have large effects on model performance when patched. We keep these edges so that they make up a sparse computational subgraph (e.g just 68/32000 edges in the Greater-Than circuit) 5/N
Tweet media one
1
0
7
@ArthurConmy
Arthur Conmy
5 months
@aryaman2020 @StephenLCasper 's criticism seems different, it's mostly a claim that SAEs will not be competitive with other techniques. This is a possible outcome of the research here ofc!
0
0
6
@ArthurConmy
Arthur Conmy
7 months
@StephenLCasper Thirdly I didn't help on this work because I think this is a crucial message for AI safety (though other authors may think this!). I like doing interesting research :) I'd feel sad if this paper in particular impacts open access to models for safety, and tell me if you see this
1
0
7
@ArthurConmy
Arthur Conmy
10 months
@jxmnop I don’t think “how many open questions are there in field X” tracks “how fast is field X moving” well in general. Also I read these all and only three of the ~20 (long text, limits of training data, multimodality) seem plausible AGI blockers in a post-GPT4 world
1
0
6
@ArthurConmy
Arthur Conmy
7 months
@StephenLCasper Secondly, I do think that there are extensions of this work that could steal many more facts. I have three important facts about models in mind that are pluasibly stealable but won't mention them in case we build on this work (but will say in a month if nothing happens!)
1
0
7
@ArthurConmy
Arthur Conmy
5 months
@idavidrein @typedfemale Thanks for addressing criticism, but why do you think that "... you can just use that fraction [estimated correct evaluation datapoints as a fraction] as a ceiling that you don’t measure performance beyond"? An eval with 74% signal and 26% noise will become bad before 74% acc ...
1
0
5
@ArthurConmy
Arthur Conmy
1 year
There's a lot of discussion about adversarial attacks on generative models (such as prompt injection). It turns out that there exist adversarial attacks against real-world RL policies
@farairesearch
FAR.AI
1 year
This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! 🧵
Tweet media one
8
89
470
1
0
7
@ArthurConmy
Arthur Conmy
1 year
I’m at my first in-person conference 🇷🇼! DM if you’d like to talk about Interpretability in the Wild, follow up work or your thoughts on how to understand LLMs - I’d love to meet!
0
0
7
@ArthurConmy
Arthur Conmy
6 months
@ericjmichaud_ Most of our (current) compute is spent on getting internal LLM activations. I haven’t benchmarked how fast this is on GPUs. We say batch sizes and training steps in the paper (total = ~1 billion activations)
3
0
7
@ArthurConmy
Arthur Conmy
8 months
@sebkrier I would wait till the full version of the paper is out. It looked very rushed and the README shows a bunch of cases where their evals have false negatives. Another issue may be that the instruct-tuned models are formatting outputs in a way that their eval does not understand.
1
0
6
@ArthurConmy
Arthur Conmy
1 year
We're in the early stages of a sudden rush of AI governance progress. Several open letters/papers, all the big AI companies commenting on RSPs, ai dot gov launch, US AI safety institute launch, and the AI safety summit is just starting!
1
0
7
@ArthurConmy
Arthur Conmy
1 year
Where does ACDC fit into automated interpretability work? @OpenAI used LMs to understand components in smaller LMs. @atticus_geiger and @ZhengxuanZenWu 's DAS can test whether model components implement causal models. ACDC could find the components these methods test! 6/N
Tweet media one
Tweet media two
1
0
7
@ArthurConmy
Arthur Conmy
9 months
Great paper! There is so much structure inside neural networks.
@wesg52
Wes Gurnee
9 months
New paper! "Universal Neurons in GPT2 Language Models" How many neurons are independently meaningful? How many neurons reappear across models with different random inits? Do these neurons specialize into specific functional roles or form feature families? Answers below 🧵:
Tweet media one
7
67
407
0
0
5
@ArthurConmy
Arthur Conmy
1 year
This was great fun @labenz @CogRev_Podcast ! Keep an eye out for the release 👀
@labenz
Nathan Labenz
1 year
So excited to interview @ArthurConmy about mechanizing mechanistic interpretability for @CogRev_Podcast tomorrow that I wrote 1100+ words of questions Let me know what I missed 😂
0
0
4
0
1
6
@ArthurConmy
Arthur Conmy
16 days
1
0
6