Buck Shlegeris Profile Banner
Buck Shlegeris Profile
Buck Shlegeris

@bshlgrs

2,819
Followers
229
Following
12
Media
393
Statuses

CEO at Redwood Research, working on technical research for AI safety.

San Francisco, CA
Joined January 2015
Don't wanna be here? Send us removal request.
Pinned Tweet
@bshlgrs
Buck Shlegeris
8 months
In a new blog post, we argue that AI labs should ensure that powerful AIs are *controlled*. That is, labs should make sure that their safety measures prevent unacceptable outcomes even if their powerful models intentionally try to subvert the safety measures. 🧵
4
10
85
@bshlgrs
Buck Shlegeris
3 months
ARC-AGI’s been hyped over the last week as a benchmark that LLMs can’t solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA.
Tweet media one
46
180
1K
@bshlgrs
Buck Shlegeris
3 months
@RyanPGreenblatt , who I think might be the world expert at getting LMs to do complicated reasoning (you'll love the project that he dropped for a week to do this one), did lots of fancy tricks to get the performance this high; you can see the details on our blog.
Tweet media one
5
7
209
@bshlgrs
Buck Shlegeris
3 months
Ryan's approach involves a long, carefully-crafted few-shot prompt that he uses to generate many possible Python programs to implement the transformations. He generates ~5k guesses, selects the best ones using the examples, then has a debugging step.
11
6
173
@bshlgrs
Buck Shlegeris
3 months
This is despite GPT-4o's non-reasoning weaknesses: - It can't see well (e.g. it gets basic details wrong) - It can't code very well - Its performance drops when there are more than 32k tokens in context These are problems that scaling seems very likely to solve.
3
3
150
@bshlgrs
Buck Shlegeris
3 months
For context, ARC-AGI is a visual reasoning benchmark that requires guessing a rule from few examples. Its creator, @fchollet , claims that LLMs are unable to learn, which is why they can't perform well on this benchmark.
Tweet media one
2
5
147
@bshlgrs
Buck Shlegeris
3 months
Scaling the number of sampled Python rules reliably increase performance (+3% accuracy for every doubling). And we are still quite far from the millions of samples AlphaCode uses!
Tweet media one
2
4
143
@bshlgrs
Buck Shlegeris
9 months
New paper! We design and test safety techniques that prevent models from causing bad outcomes even if the models collude to subvert them. We think that this approach is the most promising available strategy for minimizing risk from deceptively aligned models. 🧵
Tweet media one
3
13
139
@bshlgrs
Buck Shlegeris
3 months
Lots of people have correctly observed that around the development of AGI, computer security is going to be crucial--states are probably going to try really hard to steal model weights and algorithmic secrets. But I think most discussion of this misses a crucial point:
6
8
131
@bshlgrs
Buck Shlegeris
3 months
The results: Train set: 71% vs a human baseline of 85% Test set: 51% vs prior SoTA of 34% (human baseline is unknown) (The train set is much easier than the test set.) (These numbers are on a random subset of 100 problems that we didn't iterate on.)
10
4
130
@bshlgrs
Buck Shlegeris
29 days
I agree with these experts that it would be good for SB 1047 to pass. I’ve been dismayed to see how much many of the opponents of the bill have lied about what it says and what experts think of it; hopefully this letter will make the situation clearer.
@Scott_Wiener
Senator Scott Wiener
29 days
Top researchers Yoshua Bengio, @geoffreyhinton , Lawrence @Lessig & Stuart Russell are calling for lawmakers to pass SB 1047.  “As of now, there are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers.”
102
65
280
2
5
118
@bshlgrs
Buck Shlegeris
4 months
@julianboolean_ This is an excellent question; it's one of my favorite extremely simple questions with underappreciated answers. There is a precise answer but it requires understanding multi-electron quantum mechanics, which is somewhat complicated. A rough attempt at the answer is that:
2
4
99
@bshlgrs
Buck Shlegeris
6 months
@Simeon_Cps I think we might have an awkward situation where heaps of people (including non-leadership Anthropic people) said this privately, but Anthropic's public comms never actually said it, and maybe the leadership never said it even privately.
5
1
97
@bshlgrs
Buck Shlegeris
11 days
I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. 🧵
7
10
143
@bshlgrs
Buck Shlegeris
9 months
My colleagues and I are currently finishing up a paper and many blog posts on techniques for preventing models from causing catastrophic outcomes that are robust to the models intentionally trying to subvert the safety techniques. I'd love beta readers, DM me if interested.
1
4
81
@bshlgrs
Buck Shlegeris
2 months
Last week there was some uncertainty about whether @RyanPGreenblatt 's ARC-AGI solution was really sota, because many other solutions did better on public eval and we didn't have private test results. There is now a semi-private eval set; he's at the top of this leaderboard.
@GregKamradt
Greg Kamradt
2 months
Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI We verified his score, excited to say his method got 42% on public tasks We’re publishing a secondary leaderboard to measure attempts like these So of course we tested gpt-4, claude sonnet, and gemini
Tweet media one
7
12
178
2
3
73
@bshlgrs
Buck Shlegeris
8 months
I built a tiny game where you design and fly planes. It's like a simpler, shittier Kerbal Space Program, with a simpler and more first-principles in its physics simulator. It has the concepts of vertices, Hookean springs, faces, and an engine. 🧵
1
6
62
@bshlgrs
Buck Shlegeris
19 days
I really appreciate Eliezer's arguments here about the distribution of moral intuitions of biologically evolved aliens. I think that this question substantially affects the appropriate response to AI takeover risk. I'd love to see more serious engagement with this question.
@ESYudkowsky
Eliezer Yudkowsky ⏹️
1 year
Why? Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so, I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular
24
18
224
5
2
61
@bshlgrs
Buck Shlegeris
6 months
@Simeon_Cps Do you have a citation for Anthropic saying that? I was trying to track one down and couldn't find it.
5
0
57
@bshlgrs
Buck Shlegeris
4 months
@julianboolean_ Ok actually upon more thinking, I kind of want to say that "electromagnetic repulsion between nuclei" is the best short answer.
4
1
52
@bshlgrs
Buck Shlegeris
5 months
@ShakeelHashim This seems mostly false to me; I think rats love this kind of project and would be excited for it even if it wasn’t socially connected to them (eg see rat coverage of prospera, prediction markets, soylent, etc)
4
0
52
@bshlgrs
Buck Shlegeris
4 months
@julianboolean_ Overall I'd say that "electromagnetic repulsion" is quite a bad answer to the original question; I'd say "electromagnetic repulsion, pauli exclusion, and electron kinetic energy constraints" if asked to give a short technically correct answer
2
1
49
@bshlgrs
Buck Shlegeris
24 days
When AI companies publish about their work, they’re engaging in a confusing mix of self-promotion and scientific contribution. I think both of these are fine modes for an organization to be in, but the ambiguity leads to mismatched expectations and frustration.
@asteriskmgzn
Asterisk
25 days
Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead — and how you can be appropriately skeptical of them
1
13
58
0
5
49
@bshlgrs
Buck Shlegeris
6 months
@Simeon_Cps Here's one related quote
Tweet media one
3
0
43
@bshlgrs
Buck Shlegeris
4 months
@julianboolean_ - you can't have multi-electron states where multiple electrons are in the same place, because the wavefunction has to be antisymmetric. This is the quantum mechanical fact behind the Pauli exclusion principle.
1
0
42
@bshlgrs
Buck Shlegeris
11 days
- It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models, they'll outcompete us within a year, and you won't be able to get China to pause without substantial risk
1
1
60
@bshlgrs
Buck Shlegeris
3 months
During the period we're worried about, by assumption labs will have access to powerful general-purpose AI; there are a bunch of ways this AI could be applied to reducing security risk that might be extremely effective. I'll name four broad categories, focusing on insider threats.
1
3
34
@bshlgrs
Buck Shlegeris
4 months
@julianboolean_ Sorry for the confusing answer here. This stuff is legitimately hard, and all the textbooks that explain it go via the extremely convoluted math which is necessary to make calculations, rather than just focusing on concepts and intuitions.
3
0
36
@bshlgrs
Buck Shlegeris
1 month
When thinking about techniques that aim to reduce risk from misaligned AIs, I find it helpful to consider multiple different overall risk levels that the deployer of the AI might be targeting, because different classes of safety techniques are appropriate for different risk
1
0
35
@bshlgrs
Buck Shlegeris
4 months
@julianboolean_ - the energy of an electron state is higher when the electron wavefunction is more physically concentrated (this is a rephrasing of the Feynmann thing you quoted)
1
0
33
@bshlgrs
Buck Shlegeris
3 months
The main story I've heard people talk about in the past is that AIs can help with hardening software etc. Here I'm instead talking about AI enabling policies that wouldn’t work (because they’d be too labor-intensive or have who-guards-the-guards problems) if you just had humans.
1
0
32
@bshlgrs
Buck Shlegeris
3 months
3. Fine-grained permission management. A lot of security failures arise from some part of your system having broader permissions than it needs. It's inconvenient to specify fine-grained permissions, so people err broad. AI can enable much more precise permissioning.
1
1
30
@bshlgrs
Buck Shlegeris
10 months
I've wondered in the past why policy gradient methods are more popular than Q-learning for LLMs. Does anyone have a clear explanation?
7
1
31
@bshlgrs
Buck Shlegeris
3 months
1
0
31
@bshlgrs
Buck Shlegeris
4 months
@julianboolean_ The basic way to see it is to think through the overall energy of the system as you reduce the bond length--the repulsion between the nuclei goes up, and is not compensated by increased attraction between the nuclei and electrons because KE forces them to be spread.
1
0
31
@bshlgrs
Buck Shlegeris
3 months
1: Monitoring. You can afford to have your AI look at every action that an insider takes--every bash command they run, every configuration change they make, every script they launch.
1
0
29
@bshlgrs
Buck Shlegeris
3 months
@Simeon_Cps In hindsight I was wrong about this; I've heard from several people (some of whom said it semi-publicly) that Anthropic leadership said this to them.
2
0
30
@bshlgrs
Buck Shlegeris
3 months
2. Trust displacement. If the AI can do a sensitive task automatically, you can remove the permissions that your insiders needed to do it. For example, we might block insiders from making complicated configuration changes, instead forcing them to ask the AI to make those changes.
2
0
29
@bshlgrs
Buck Shlegeris
2 months
@Scott_Wiener Thanks for all the work you're doing on this bill; it seems like a really promising way of relatively cheaply regulating the worst risks from this crucial industry.
0
0
28
@bshlgrs
Buck Shlegeris
4 months
@julianboolean_ The simplest physical system which displays the physical phenomenon you're asking about is the LiH molecule--you're asking why it is that the energy of the system goes to infinity if you try to make the bond length very small.
2
0
28
@bshlgrs
Buck Shlegeris
6 months
@Simeon_Cps Another relevant quote
Tweet media one
2
0
27
@bshlgrs
Buck Shlegeris
8 months
If we can catch scheming models red-handed trying to escape, their situation becomes much worse. This has important and underappreciated implications for AI safety. More here:
1
0
26
@bshlgrs
Buck Shlegeris
28 days
To be clear, I don't agree with all the arguments that they made. And as many others have pointed out, hairdressers are the textbook example of unnecessary regulation increasing costs and hurting consumers; I feel uncomfortable with that example.
0
0
24
@bshlgrs
Buck Shlegeris
11 days
You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigate and you're basically sure the model did it on purpose.
2
0
40
@bshlgrs
Buck Shlegeris
3 months
4. AI investigation of automatically detected suspicious activity. Infosec teams spend a lot of time investigating automated alerts. AI can drastically reduce the price of such investigation and so massively increase its quantity.
2
0
21
@bshlgrs
Buck Shlegeris
8 months
@joshua_clymer Amusingly, I think Paul (and some others) had thought through almost all these considerations in substantial depth by 2017.
0
1
22
@bshlgrs
Buck Shlegeris
11 days
And undeploying is plausibly going to be extremely expensive and risky. So you'll have to convince people that AI takeover is not just a serious threat, but that it’s a risk so large that it’s worth trading off our response to other risks that might be extremely pressing and
1
0
38
@bshlgrs
Buck Shlegeris
11 days
I’m sympathetic to all of these arguments; the only reason I’d be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn’t change my mind based on one weird observation.
2
0
39
@bshlgrs
Buck Shlegeris
3 months
@Scott_Wiener Thanks for your work here. I’m glad for the clarification about derivative models.
0
0
21
@bshlgrs
Buck Shlegeris
11 days
So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this:
2
0
36
@bshlgrs
Buck Shlegeris
3 months
@iamtrask @OpenAI Please note that the GPT-2 they're talking about is GPT-2 small, not GPT-2 XL (which is 12x bigger)
1
0
19
@bshlgrs
Buck Shlegeris
3 months
In this thread I didn't discuss what happens if the AIs might be intentionally trying to cause security failures; this is an important concern but even if the AIs are scheming I think you can probably mostly make the above techniques work; I'll justify this in a later piece.
1
0
17
@bshlgrs
Buck Shlegeris
5 months
@ShakeelHashim @ChanaMessinger Most of the rats saying nice things about the things I listed don't personally know the people involved. (Hanson is a partial counterexample, e.g. I've met him several times. But most of us have never met Thiel and have even less social connection to e.g. Prospera.)
0
0
19
@bshlgrs
Buck Shlegeris
10 months
@gcolbourn I’m 99% sure that an AI did not break AES.
1
0
19
@bshlgrs
Buck Shlegeris
8 months
Thinking through these arguments over the last six months (with Ryan Greenblatt, who had the majority of the original ideas here) has given me a way clearer picture of how you could mitigate risk from scheming models, and substantially reduced my inside-view P(doom).
2
1
18
@bshlgrs
Buck Shlegeris
4 months
@natfriedman Chatbot that answers questions by tracking down primary sources. Eg “what were intellectuals saying about the performance of the Soviet economy in the 30s”
0
0
18
@bshlgrs
Buck Shlegeris
7 months
@eshear I'm very sympathetic to this concern. But I don't think it's clear that AIs this powerful will be moral patients, and I don't think that control seems particularly worse than other safety interventions in terms of welfare.
2
0
18
@bshlgrs
Buck Shlegeris
10 months
If you're interested in doing a research project with me, consider applying to the Astra fellowship
@laurolangosco
Lauro
10 months
Announcing two Constellation programs! Apps due Nov 10 Astra fellowship: do research with mentors like @EthanJPerez @OwainEvans_UK @RichardMCNgo @ajeya_cotra Visiting researchers: exchange ideas with people from top AI safety orgs
3
18
64
1
0
17
@bshlgrs
Buck Shlegeris
3 months
@ESYudkowsky Where did you make the prediction that we’d get amazing personal assistants and then die? I’m skeptical that you falsifiably stated this prediction.
2
0
17
@bshlgrs
Buck Shlegeris
10 months
Here's a simple question about treacherous turn strategy that I wrote (for testing AIs and maybe as an interview question for humans). GPT-4 is not very good at it.
Tweet media one
2
0
16
@bshlgrs
Buck Shlegeris
3 months
@chapman_bliss Presumably. There's no public access to gpt-4o finetuning though.
2
0
15
@bshlgrs
Buck Shlegeris
10 months
I think that this paper is much clearer and more thorough than previous attempts to make these important arguments.
@robertwiblin
Rob Wiblin
10 months
New paper: "Scheming AIs — Will AIs fake alignment during training in order to get power?"
Tweet media one
6
15
66
0
2
14
@bshlgrs
Buck Shlegeris
6 months
@davidmanheim @Simeon_Cps @moskov Yeah I thought about this but imo that group is functionally public and it makes no sense for Dustin to assume confidentiality in it (eg many people sent me the above screenshot).
1
0
13
@bshlgrs
Buck Shlegeris
8 months
If you try to control models—that is, you try to make it so that you'll be safe even if they attempt to subvert your safety techniques and safety evaluations—then in order for scheming models to cause a catastrophe, you need to have failed at both alignment and control.
1
0
13
@bshlgrs
Buck Shlegeris
8 months
Faces exert a force proportional to the rate at which they displace air. You implement controls by changing the rest lengths of springs and the thrust of your engine. This is the first time in a decade I've built a game; it was a lot of fun. Lmk if you build better planes!
1
0
13
@bshlgrs
Buck Shlegeris
8 months
@TheZvi “Their P(doom) is 5%, 200% of which comes from loss of control”
0
0
13
@bshlgrs
Buck Shlegeris
3 months
Ok actually these results are slightly wrong: we got 72% and 50% with the final version, not 71% and 51%
0
0
13
@bshlgrs
Buck Shlegeris
6 months
@ohabryka @EpistemicHope I feel like Dustin’s statement doesn’t disambiguate “don’t go very far beyond frontier” and “don’t push frontier at all”.
2
0
12
@bshlgrs
Buck Shlegeris
8 months
We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.
1
1
11
@bshlgrs
Buck Shlegeris
3 months
I think it's actually ambiguous whether this is sota, see here.
@RyanPGreenblatt
Ryan Greenblatt
3 months
@Jcole75Cole If I had realized there was such a large gap between public/private for prior approaches, I wouldn't have claimed SoTA. Apologies. I'm considering what edits I should make. If the gap is due to the different datasets actually differing in difficulty, this is quite unfortunate.
9
3
114
0
0
12
@bshlgrs
Buck Shlegeris
8 months
There are two main lines of defense you could employ to prevent schemers from causing catastrophes. • Alignment: Ensure that your models aren't scheming. • Control: Ensure that even if your models are scheming, you'll be safe
1
0
11
@bshlgrs
Buck Shlegeris
6 months
@Simeon_Cps Tbc, it's not necessarily Anthropic's responsibility to prevent randos from having misconceptions about their plans.
4
0
12
@bshlgrs
Buck Shlegeris
8 months
Many of our arguments here have been made before, but I think they've been historically under-discussed and some of their implications seem very important.
1
0
11
@bshlgrs
Buck Shlegeris
4 months
@norvid_studies "Physically, what's the difference between a blue object and a green object"
2
0
11
@bshlgrs
Buck Shlegeris
3 months
@besanushi @OpenAI Yeah this is ridiculous, it makes using the model as a classifier way harder.
1
0
11
@bshlgrs
Buck Shlegeris
2 months
Thanks to the ARC team for helping with running Ryan’s submission, and to @Jcole75Cole and @MaxNadeau_ for helpful discussion, and thanks to the community as a whole for being chill during the uncertainty here.
0
0
11
@bshlgrs
Buck Shlegeris
4 months
@julianboolean_ The basic case for this answer is that if you say "what happens if you push two atoms really close together" and calculate the energy of that system (by assuming electrons relax to minimum energy state), the energy term which gets very large is the nuclear repulsion.
0
0
11
@bshlgrs
Buck Shlegeris
11 months
Fabien Roger and I demonstrated that you can embed arbitrary state machines into the backward pass of neural nets built from standard components, allowing them to "remember" arbitrary facts about their training distribution and modulate their behavior based on this.
@FabienDRoger
Fabien Roger
11 months
You can use SGD to do any stateful computation in a neural network! For example, we built a network such that when it has seen ten training points (x1,x2) with x1>3, it starts getting high loss, and when it has seen ten training points with x2>3, it gets low loss again.🧵(1/5)
Tweet media one
1
0
16
1
0
11
@bshlgrs
Buck Shlegeris
4 months
@TheDavidSJ @stanislavfort @idavidrein I think that it's even worse--at 1000 years, we're getting to the edge of the Milky Way, at which point, we have to resign ourselves to quadratic growth.
1
1
11
@bshlgrs
Buck Shlegeris
19 days
If you're particularly interested in this question for some reason, have some background in researching confusing questions, and want a research stipend to investigate it, feel free to DM me.
1
1
11
@bshlgrs
Buck Shlegeris
10 months
I think that coup probes have >10% chance of mostly preventing AI takeover attempts that involve taking small numbers of egregiously bad actions; I'm excited for Fabien's writeup and experimental results.
@FabienDRoger
Fabien Roger
10 months
AI labs can monitor their smart model using: - A copy of the smart model, but it might lie - A less capable model, but it might miss subtle catastrophes - Probes trained on middle layers of the smart model → explanation and preliminary experiments here:
Tweet media one
0
0
13
0
0
10
@bshlgrs
Buck Shlegeris
9 months
Scenarios where AIs intentionally subvert safety techniques might be the main source of AI takeover risk. But there’s been almost zero empirical work that explicitly designs techniques so that they’re safe even if the models try to subvert them.
1
0
10
@bshlgrs
Buck Shlegeris
8 months
@joshua_clymer Note that the conditional policy here is very simple, so this isn't much evidence re whether the deceptive alignment threat model (where the condition is very complicated) is plausible.
0
0
10
@bshlgrs
Buck Shlegeris
7 months
@michael_nielsen One notable difference is that the 1900s inventions mostly didn’t develop enough to practically matter much in that decade, while the 2000s inventions did
0
0
10
@bshlgrs
Buck Shlegeris
19 days
I think that this question is probably more important than many that x-risk-concerned people study, but unfortunately it's very confusing and so most people who have a good chance of contributing to it have really high opportunity cost.
1
0
10
@bshlgrs
Buck Shlegeris
5 months
@ShakeelHashim (For the record, I think that rats sometimes get overexcited about hare-brained schemes and so I'm glad that rats like e.g. @natalia__coelho are pushing back on this; I wouldn't be shocked if this product doesn't work.)
0
0
10
@bshlgrs
Buck Shlegeris
9 months
🥈 Ask the maybe-malign GPT-4 agent for a solution. Then ask another instance of it whether that solution contains a backdoor. The key ingredient to prevent lying: redaction to prevent coordination and paraphrasing to prevent steganographic communication.
2
1
10
@bshlgrs
Buck Shlegeris
7 months
@RatOrthodox My preferred version of a project along these lines would be someone choosing a set of features randomly, and then carefully measuring how interpretable they are.
3
0
10
@bshlgrs
Buck Shlegeris
9 months
DM me if you want to beta read our upcoming posts (thanks heaps to the people who reached out to beta read this), and also LMK if you’re interested in advice on building on this work!
0
0
9
@bshlgrs
Buck Shlegeris
9 months
In our new paper, we evaluate whether techniques ensure safety despite intentional subversion by letting a red-team propose strategies for our untrusted model, and seeing whether these lead to safety failures. And we use this evaluation to develop robust safety techniques.
1
0
9
@bshlgrs
Buck Shlegeris
5 months
@ohabryka I think the article is vacuous but its claims seem truthy enough. I don't think it makes statements about the extent of reform in EA that are specific enough to be misleading.
1
0
9
@bshlgrs
Buck Shlegeris
7 months
@RokoMijic And nowadays I'd also note that plausibly you can manage risk merely by ensuring the first AIs that pose big risk + big opportunity are controlled rather than aligned.
4
2
8