In a new blog post, we argue that AI labs should ensure that powerful AIs are *controlled*. That is, labs should make sure that their safety measures prevent unacceptable outcomes even if their powerful models intentionally try to subvert the safety measures. 🧵
ARC-AGI’s been hyped over the last week as a benchmark that LLMs can’t solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA.
@RyanPGreenblatt
, who I think might be the world expert at getting LMs to do complicated reasoning (you'll love the project that he dropped for a week to do this one), did lots of fancy tricks to get the performance this high; you can see the details on our blog.
Ryan's approach involves a long, carefully-crafted few-shot prompt that he uses to generate many possible Python programs to implement the transformations. He generates ~5k guesses, selects the best ones using the examples, then has a debugging step.
This is despite GPT-4o's non-reasoning weaknesses:
- It can't see well (e.g. it gets basic details wrong)
- It can't code very well
- Its performance drops when there are more than 32k tokens in context
These are problems that scaling seems very likely to solve.
For context, ARC-AGI is a visual reasoning benchmark that requires guessing a rule from few examples. Its creator,
@fchollet
, claims that LLMs are unable to learn, which is why they can't perform well on this benchmark.
Scaling the number of sampled Python rules reliably increase performance (+3% accuracy for every doubling). And we are still quite far from the millions of samples AlphaCode uses!
New paper! We design and test safety techniques that prevent models from causing bad outcomes even if the models collude to subvert them. We think that this approach is the most promising available strategy for minimizing risk from deceptively aligned models. 🧵
Lots of people have correctly observed that around the development of AGI, computer security is going to be crucial--states are probably going to try really hard to steal model weights and algorithmic secrets. But I think most discussion of this misses a crucial point:
The results:
Train set: 71% vs a human baseline of 85%
Test set: 51% vs prior SoTA of 34% (human baseline is unknown)
(The train set is much easier than the test set.)
(These numbers are on a random subset of 100 problems that we didn't iterate on.)
I agree with these experts that it would be good for SB 1047 to pass. I’ve been dismayed to see how much many of the opponents of the bill have lied about what it says and what experts think of it; hopefully this letter will make the situation clearer.
Top researchers Yoshua Bengio,
@geoffreyhinton
, Lawrence
@Lessig
& Stuart Russell are calling for lawmakers to pass SB 1047.
“As of now, there are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers.”
@julianboolean_
This is an excellent question; it's one of my favorite extremely simple questions with underappreciated answers. There is a precise answer but it requires understanding multi-electron quantum mechanics, which is somewhat complicated. A rough attempt at the answer is that:
@Simeon_Cps
I think we might have an awkward situation where heaps of people (including non-leadership Anthropic people) said this privately, but Anthropic's public comms never actually said it, and maybe the leadership never said it even privately.
I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. 🧵
My colleagues and I are currently finishing up a paper and many blog posts on techniques for preventing models from causing catastrophic outcomes that are robust to the models intentionally trying to subvert the safety techniques. I'd love beta readers, DM me if interested.
Last week there was some uncertainty about whether
@RyanPGreenblatt
's ARC-AGI solution was really sota, because many other solutions did better on public eval and we didn't have private test results. There is now a semi-private eval set; he's at the top of this leaderboard.
Last week
@RyanPGreenblatt
shared his gpt-4o based attempt on ARC-AGI
We verified his score, excited to say his method got 42% on public tasks
We’re publishing a secondary leaderboard to measure attempts like these
So of course we tested gpt-4, claude sonnet, and gemini
I built a tiny game where you design and fly planes. It's like a simpler, shittier Kerbal Space Program, with a simpler and more first-principles in its physics simulator. It has the concepts of vertices, Hookean springs, faces, and an engine. 🧵
I really appreciate Eliezer's arguments here about the distribution of moral intuitions of biologically evolved aliens. I think that this question substantially affects the appropriate response to AI takeover risk. I'd love to see more serious engagement with this question.
Why? Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so, I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular
@ShakeelHashim
This seems mostly false to me; I think rats love this kind of project and would be excited for it even if it wasn’t socially connected to them (eg see rat coverage of prospera, prediction markets, soylent, etc)
@julianboolean_
Overall I'd say that "electromagnetic repulsion" is quite a bad answer to the original question; I'd say "electromagnetic repulsion, pauli exclusion, and electron kinetic energy constraints" if asked to give a short technically correct answer
When AI companies publish about their work, they’re engaging in a confusing mix of self-promotion and scientific contribution. I think both of these are fine modes for an organization to be in, but the ambiguity leads to mismatched expectations and frustration.
@julianboolean_
- you can't have multi-electron states where multiple electrons are in the same place, because the wavefunction has to be antisymmetric. This is the quantum mechanical fact behind the Pauli exclusion principle.
- It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models, they'll outcompete us within a year, and you won't be able to get China to pause without substantial risk
During the period we're worried about, by assumption labs will have access to powerful general-purpose AI; there are a bunch of ways this AI could be applied to reducing security risk that might be extremely effective. I'll name four broad categories, focusing on insider threats.
@julianboolean_
Sorry for the confusing answer here. This stuff is legitimately hard, and all the textbooks that explain it go via the extremely convoluted math which is necessary to make calculations, rather than just focusing on concepts and intuitions.
When thinking about techniques that aim to reduce risk from misaligned AIs, I find it helpful to consider multiple different overall risk levels that the deployer of the AI might be targeting, because different classes of safety techniques are appropriate for different risk
@julianboolean_
- the energy of an electron state is higher when the electron wavefunction is more physically concentrated (this is a rephrasing of the Feynmann thing you quoted)
The main story I've heard people talk about in the past is that AIs can help with hardening software etc. Here I'm instead talking about AI enabling policies that wouldn’t work (because they’d be too labor-intensive or have who-guards-the-guards problems) if you just had humans.
3. Fine-grained permission management. A lot of security failures arise from some part of your system having broader permissions than it needs. It's inconvenient to specify fine-grained permissions, so people err broad. AI can enable much more precise permissioning.
@julianboolean_
The basic way to see it is to think through the overall energy of the system as you reduce the bond length--the repulsion between the nuclei goes up, and is not compensated by increased attraction between the nuclei and electrons because KE forces them to be spread.
1: Monitoring. You can afford to have your AI look at every action that an insider takes--every bash command they run, every configuration change they make, every script they launch.
@Simeon_Cps
In hindsight I was wrong about this; I've heard from several people (some of whom said it semi-publicly) that Anthropic leadership said this to them.
2. Trust displacement. If the AI can do a sensitive task automatically, you can remove the permissions that your insiders needed to do it. For example, we might block insiders from making complicated configuration changes, instead forcing them to ask the AI to make those changes.
@Scott_Wiener
Thanks for all the work you're doing on this bill; it seems like a really promising way of relatively cheaply regulating the worst risks from this crucial industry.
@julianboolean_
The simplest physical system which displays the physical phenomenon you're asking about is the LiH molecule--you're asking why it is that the energy of the system goes to infinity if you try to make the bond length very small.
If we can catch scheming models red-handed trying to escape, their situation becomes much worse. This has important and underappreciated implications for AI safety. More here:
To be clear, I don't agree with all the arguments that they made. And as many others have pointed out, hairdressers are the textbook example of unnecessary regulation increasing costs and hurting consumers; I feel uncomfortable with that example.
You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigate and you're basically sure the model did it on purpose.
4. AI investigation of automatically detected suspicious activity. Infosec teams spend a lot of time investigating automated alerts. AI can drastically reduce the price of such investigation and so massively increase its quantity.
And undeploying is plausibly going to be extremely expensive and risky. So you'll have to convince people that AI takeover is not just a serious threat, but that it’s a risk so large that it’s worth trading off our response to other risks that might be extremely pressing and
I’m sympathetic to all of these arguments; the only reason I’d be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn’t change my mind based on one weird observation.
So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this:
In this thread I didn't discuss what happens if the AIs might be intentionally trying to cause security failures; this is an important concern but even if the AIs are scheming I think you can probably mostly make the above techniques work; I'll justify this in a later piece.
@ShakeelHashim
@ChanaMessinger
Most of the rats saying nice things about the things I listed don't personally know the people involved. (Hanson is a partial counterexample, e.g. I've met him several times. But most of us have never met Thiel and have even less social connection to e.g. Prospera.)
Thinking through these arguments over the last six months (with Ryan Greenblatt, who had the majority of the original ideas here) has given me a way clearer picture of how you could mitigate risk from scheming models, and substantially reduced my inside-view P(doom).
@natfriedman
Chatbot that answers questions by tracking down primary sources. Eg “what were intellectuals saying about the performance of the Soviet economy in the 30s”
@eshear
I'm very sympathetic to this concern. But I don't think it's clear that AIs this powerful will be moral patients, and I don't think that control seems particularly worse than other safety interventions in terms of welfare.
@ESYudkowsky
Where did you make the prediction that we’d get amazing personal assistants and then die? I’m skeptical that you falsifiably stated this prediction.
Here's a simple question about treacherous turn strategy that I wrote (for testing AIs and maybe as an interview question for humans). GPT-4 is not very good at it.
@davidmanheim
@Simeon_Cps
@moskov
Yeah I thought about this but imo that group is functionally public and it makes no sense for Dustin to assume confidentiality in it (eg many people sent me the above screenshot).
If you try to control models—that is, you try to make it so that you'll be safe even if they attempt to subvert your safety techniques and safety evaluations—then in order for scheming models to cause a catastrophe, you need to have failed at both alignment and control.
Faces exert a force proportional to the rate at which they displace air. You implement controls by changing the rest lengths of springs and the thrust of your engine. This is the first time in a decade I've built a game; it was a lot of fun. Lmk if you build better planes!
We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.
@Jcole75Cole
If I had realized there was such a large gap between public/private for prior approaches, I wouldn't have claimed SoTA. Apologies. I'm considering what edits I should make.
If the gap is due to the different datasets actually differing in difficulty, this is quite unfortunate.
There are two main lines of defense you could employ to prevent schemers from causing catastrophes.
• Alignment: Ensure that your models aren't scheming.
• Control: Ensure that even if your models are scheming, you'll be safe
Many of our arguments here have been made before, but I think they've been historically under-discussed and some of their implications seem very important.
Thanks to the ARC team for helping with running Ryan’s submission, and to
@Jcole75Cole
and
@MaxNadeau_
for helpful discussion, and thanks to the community as a whole for being chill during the uncertainty here.
@julianboolean_
The basic case for this answer is that if you say "what happens if you push two atoms really close together" and calculate the energy of that system (by assuming electrons relax to minimum energy state), the energy term which gets very large is the nuclear repulsion.
Fabien Roger and I demonstrated that you can embed arbitrary state machines into the backward pass of neural nets built from standard components, allowing them to "remember" arbitrary facts about their training distribution and modulate their behavior based on this.
You can use SGD to do any stateful computation in a neural network!
For example, we built a network such that when it has seen ten training points (x1,x2) with x1>3, it starts getting high loss, and when it has seen ten training points with x2>3, it gets low loss again.🧵(1/5)
@TheDavidSJ
@stanislavfort
@idavidrein
I think that it's even worse--at 1000 years, we're getting to the edge of the Milky Way, at which point, we have to resign ourselves to quadratic growth.
If you're particularly interested in this question for some reason, have some background in researching confusing questions, and want a research stipend to investigate it, feel free to DM me.
I think that coup probes have >10% chance of mostly preventing AI takeover attempts that involve taking small numbers of egregiously bad actions; I'm excited for Fabien's writeup and experimental results.
AI labs can monitor their smart model using:
- A copy of the smart model, but it might lie
- A less capable model, but it might miss subtle catastrophes
- Probes trained on middle layers of the smart model → explanation and preliminary experiments here:
Scenarios where AIs intentionally subvert safety techniques might be the main source of AI takeover risk. But there’s been almost zero empirical work that explicitly designs techniques so that they’re safe even if the models try to subvert them.
@joshua_clymer
Note that the conditional policy here is very simple, so this isn't much evidence re whether the deceptive alignment threat model (where the condition is very complicated) is plausible.
@michael_nielsen
One notable difference is that the 1900s inventions mostly didn’t develop enough to practically matter much in that decade, while the 2000s inventions did
I think that this question is probably more important than many that x-risk-concerned people study, but unfortunately it's very confusing and so most people who have a good chance of contributing to it have really high opportunity cost.
@ShakeelHashim
(For the record, I think that rats sometimes get overexcited about hare-brained schemes and so I'm glad that rats like e.g.
@natalia__coelho
are pushing back on this; I wouldn't be shocked if this product doesn't work.)
🥈 Ask the maybe-malign GPT-4 agent for a solution. Then ask another instance of it whether that solution contains a backdoor.
The key ingredient to prevent lying: redaction to prevent coordination and paraphrasing to prevent steganographic communication.
@RatOrthodox
My preferred version of a project along these lines would be someone choosing a set of features randomly, and then carefully measuring how interpretable they are.
DM me if you want to beta read our upcoming posts (thanks heaps to the people who reached out to beta read this), and also LMK if you’re interested in advice on building on this work!
In our new paper, we evaluate whether techniques ensure safety despite intentional subversion by letting a red-team propose strategies for our untrusted model, and seeing whether these lead to safety failures. And we use this evaluation to develop robust safety techniques.
@ohabryka
I think the article is vacuous but its claims seem truthy enough. I don't think it makes statements about the extent of reform in EA that are specific enough to be misleading.
@RokoMijic
And nowadays I'd also note that plausibly you can manage risk merely by ensuring the first AIs that pose big risk + big opportunity are controlled rather than aligned.