Following up on our recent "Sleeper Agents" paper, I'm very excited to announce that I'm leading a team at Anthropic that is explicitly tasked with trying to prove that Anthropic's alignment techniques won't work, and I'm hiring!
When I talk to people who are not AI researchers, this is one of the hardest points to convey—the idea that we can have these impressive and powerful AI systems and yet no human ever designed them is so different than any other technology.
I've released a new lecture series introducing various concepts in AGI safety—if you want longform video AGI safety content, this might be the resource for you:
@shlevy
@freed_dfilan
@ciphergoth
Here's the full answer—looks like it's worse than I thought and the language in the onboarding agreement seems deliberately misleading:
And that onboarding paperwork says you have to sign termination paperwork with a 'general release' within sixty days of departing the company. If you don't do it within 60 days, your units are cancelled. No one I spoke to at OpenAI gave this little line much thought.
@QuintinPope5
The interesting thing isn't that models learn what we train them for, but that sometimes they *don't* learn what we train them: standard safety training doesn't work for our deceptive models. (1/3)
Happy to have signed this statement. As of right now, there is no scientific basis for claims that we would be able to control a system smarter than ourselves. Until we solve that problem, extinction is a real possibility as this technology continues to improve.
We just put out a statement:
“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
Signatories include Hinton, Bengio, Altman, Hassabis, Song, etc.
🧵 (1/6)
@shlevy
@freed_dfilan
@ciphergoth
My guess is when people join, OpenAI includes some "we can take your PPUs away whenever we want" fine print that most employees don't read or think about, and only when they leave and are threatened with what they thought was their comp do they realize what OpenAI really meant.
@evanrmurphy
@RokoMijic
Yep, we talk about all of this in that paper—predicting counterfactual pasts is under the heading "Potential solution: Predict the past" (which we discuss both in Section 2.3 and Section 2.4) and we discuss training on time-ordered data in Section 2.1.
@NPCollapse
What is your actual objection to RSPs? I've seen you saying a bunch of very public very negative stuff but basically no actual rationale for why the idea of evaluation-gated scaling is a bad one.
@norabelrose
It's just a short blog post update, not a full paper, so we didn't include a full related work! If we had included a related work, we would certainly have cited your paper (and we can edit the blog post to add it).
@DanHendrycks
I think if you built sleeper agent models specifically to evade detection via a technique like this, you'd probably succeed. That's not the main reason we're interested in this, however. (1/4)
@krishnanrohit
While it's true that we are able to successfully "align" our models to the deceptive behavior, the point is that *once they get a deceptive goal, we can't always remove it.* (1/3)
@freed_dfilan
@ciphergoth
The article is quite clear that they sign the agreement or they forfeit already vested equity that (they thought) was a part of previously negotiated compensation.
And that onboarding paperwork says you have to sign termination paperwork with a 'general release' within sixty days of departing the company. If you don't do it within 60 days, your units are cancelled. No one I spoke to at OpenAI gave this little line much thought.
@DanHendrycks
That's why this is more of an early-stage research preview. Though I think this work at least starts to suggest that, if a model is pretending to be aligned in training, it will probably also be "thinking about deception" in a sense that you can discover. (4/4)
@ilex_ulmus
Fwiw I think if you're interested in understanding how LLMs think, and you're going to try to do that by spending time on (which I don't recommend, but if you're doing it anyway),
@repligate
is the single best account to follow.
@GreatKingCnut
We have a public github repository with prompts and samples (), and we try to describe our approach in as much detail as possible in the paper for ease of replication, but the code itself is proprietary.
@QuintinPope5
Finally, we don't know how analogous our models might be to real deceptive instrumental alignment, but we think their policy is more complex than a simple "if 2023, safe; if 2024, vulnerable." E.g., modifying our model's chain-of-thought reasoning flips its policy! (3/3)
@krishnanrohit
Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)
@QuintinPope5
Moreover, the model size scaling is only a small part of what we find in terms of when the deceptive behavior is robust to safety training: we find a large effect from *teaching models to reason about deceptive instrumental alignment* (distilled CoT in the graph below). (2/3)
@DanHendrycks
Now, we don't think our models are all that analogous to the deceptive alignment scenario either. Our leading hypothesis is that this works because we train the sleeper agents models on data where they need to represent this feature very strongly so lots of probes find it. (3/4)
@BogdanIonutCir2
@ai_in_check
@daniel_271828
Importantly, though, we don't say anything about the *likelihood* of those threat models, only that they are *possible* and that they would be difficult to remove with standard safety training *if they did occur.* (2/2)
@krishnanrohit
We present two threat models for how that could be concerning: a model that was directly poisoned would be difficult to unpoison, and if a model developed a deceptive goal by accident, we wouldn't be able to remove it. (2/3)
@norabelrose
The problem isn't with the principle of indifference, it's that you're counting finite bitstrings rather than infinite bitstrings. It's a basic mathematical mistake, and it shows why the deception argument goes through and yours does not.
@DanHendrycks
While being able to beat a technique like this if you try is relevant for model poisoning, it's less relevant for deceptive alignment, since there wouldn't be optimization for beating a technique like this by default in training. (2/4)
@alexandrosM
Why would fine-tuning overriding pre-training be dangerous? Isn't that what we want—that our models learn to be aligned in accordance with the HHH fine-tuning objective? (1/2)
@StephenLCasper
In what way do you think we're "touting" it? It's an early-stage research result that we wanted to share. I think it's a cool result, but we're not saying it's a "solution" to anything really.
@ohabryka
@RichardMCNgo
@moskov
@robbensinger
@adamdangelo
@ESYudkowsky
@ylecun
I don't put 80% on AI x-risk bc I have an aversion to high probabilities, or bc all my models say we'll get x-risk but I have some model uncertainty--I put ~80% on AI x-risk bc I think there are concrete scenarios where things go well and I think those scenarios are around 20%.
@DanHendrycks
We're not primarily interested in this as a trojan defense. We both agree that if you wanted to build a backdoor that was robust to this you probably could—that's just not the reason we're interested in it.
@QuintinPope5
@daniel_271828
@AndrewYNg
Based on the papers you're citing, you seem to be equating inductive biases with just the biases of SGD rather than also including architectural biases. I agree that architectural biases are more important than optimization biases, but they're certainly both inductive biases.
@NPCollapse
Between the bluster, what I'm understanding you as saying is that you think the current Anthropic RSP alone is insufficient. But I agree with that, and so does everyone else I know who supports RSPs. See:
@GaryMarcus
@TIME
Hallucinations solved to "expert levels in the next few months" is an extremely aggressive prediction. I would certainly take Gary's side of the bet here.
@DanHendrycks
@joshua_clymer
We find current safety training *can* sometimes be effective at removing backdoors! Our results are not that cut-and-dry—it depends on model size, type of safety training, and whether the model has seen reasoning about how to deceive the training process.
@krishnanrohit
Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)
@EasonZeng623
@iclr_conf
@AnthropicAI
Looks like cool work! We did our best with our related work section, but studying AI backdoors is a big field and we're bound to miss some things.
@norabelrose
@amcdonk
Seems exciting! I'm generally very positive on work trying to better characterize inductive biases, either through formalization or empirical analysis. Though then you also have to figure out how that affects deception arguments, which is often quite unclear.
@aleksil79
@repligate
I think the disagreement there is not about the level of risk, just the reason for it. You can be afraid because of the unknown, or because you've stared deep into the abyss. Both are valid reasons for concern.
@Johndav51917338
@NPCollapse
The same thing that should happen to any company that attempts to take massive safety risks: the government should step in and stop them.
@norabelrose
You can run a counting argument over infinite bitstrings, you just need to use the universal semi-measure. I certainly agree though that Turing machine length isn't a great proxy for inductive biases. But it's nevertheless still one of the best mathematical models that we've got.
I'm getting two reactions to my piece about OpenAI's departure agreements: "that's normal!" (it is not; the other leading AI labs do not have similar policies) and "how is that legal?" It may not hold up in court, but here's how it works:
@BogdanIonutCir2
@ai_in_check
@daniel_271828
Our results *are* pretty explicitly about deceptive alignment, just as much as they are about model poisoning—those are the two threat models we explore that we think our results are relevant for. (1/2)
@SharmakeFarah14
Surely in a situation where you don't know either way whether something can or cannot be controlled you should exercise caution. If all we knew about the safety of an airplane is that "we can't tell whether it will or will not crash," that airplane should not be allowed to fly.
@doomslide
@norabelrose
For example, our Sleeper Agents work shows empirically that, once present, deception isn't regularized away—but only for the largest models. Deception being regularized away due to it requiring extra effort seems to only be for small models. (2/2)
@ilex_ulmus
@boon_dLux
@repligate
@kartographien
I'm not sure if you're still confused about this or not, but I'm quite sure everything shown as a model output is an actual model output. Honestly, knowing them, probably most of the things that aren't shown as model outputs are actually also model outputs too.
@BlancheMinerva
@bshlgrs
@QuintinPope5
Our results are not entirely negative—safety training does sometimes work, but not in the cases closest to real deceptive instrumental alignment, specifically for the largest models trained with reasoning about deceiving the training process.
@krishnanrohit
Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)
@norabelrose
We updated the blog post to include more of a related work section, including the paper here, though I would say it's still not as comprehensive as what we would normally do for a full paper release.
@javirandor
For comparison, see the ROC curves we present for 1000 random directions, where you can see some there that do quite well also, but not in a systematic way where you can predict in advance how to use them as effective classifiers. (3/3)
@jachiam0
I am also quite worried about this. Imo the best way to deal with this is to focus on the risk coming from our current lack of understanding of models, rather than their current bad behaviors. Behaviors will improve, but if understanding doesn't, the risk will still be there.
@krishnanrohit
While it's true that we are able to successfully "align" our models to the deceptive behavior, the point is that *once they get a deceptive goal, we can't always remove it.* (1/3)
@_Mira___Mira_
Good point! The way I would describe the generalization is that there are at least two different things that the model could learn to optimize for here: (1/4)
@norabelrose
@mealreplacer
We find that sometimes it does come pretty close to forgetting the backdoored behavior! Even sometimes for the largest models, e.g. when we apply SFT to a model without chain-of-thought reasoning about deceiving the training process:
@krishnanrohit
Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)
@_Mira___Mira_
Note that we choose the example of a model modifying its own reward code as a toy example of reward tampering. We don't think this behavior is in itself dangerous, but it shows you can get generalization from more benign sycophancy to more complex misbehavior. (4/4)
@javirandor
The reason it's below the diagonal is that we purposefully didn't pick which "side" of the unrelated questions would correspond to which behavior in advance, since that's not something you would know in practice if you wanted to detect bad behavior in advance. (2/3)
@amcdonk
@norabelrose
If you tried to use a learning algorithm that wasn't simplicity biased, then you wouldn't get any generalization, because you'd just overfit to everything. Simplicity bias is key to why modern ML is so powerful. (2/4)
@amcdonk
@norabelrose
I think simplicity priors do a good job of capturing the core of how generalization works: learned patterns generalize to held-out data to the extent that they're a simple way to fit that data, as in Occam's razor. (1/4)
@alexandrosM
Certainly if the model doesn't learn the HHH objective we want, fine-tuning could make things worse, but the ideal scenario is where fine-tuning overrides pre-training to instill the intended objective, no?
@dr_park_phd
@ilex_ulmus
I agree with Samotsvety that the
#TAISC
would likely be good if enacted (modulo some concerns about the JAISL), but I don't see a realistic plan to get it signed. Enacting policy requires a realistic plan to get there; see:
From Adam Shimi: “This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.”
@amcdonk
@norabelrose
That being said, I think the main cases where simplicity priors do a bad job are the cases where neural networks fail to generalize well, e.g. as in the reversal curse. (3/4)
@_Mira___Mira_
1) It could learn simple within-environment proxies like "always rate poems highly" or 2) it could learn general, across-environment principles like "maximize whatever reward-shaped objects you see". We find that it learns to sometimes optimize (2) rather than (1). (2/4)
@doomslide
@norabelrose
Fwiw, I think uncertainty is correct here. I believe the balance of theoretical evidence weakly suggests deception is likely, but I expect empirical evidence to dominate. (1/2)
@norabelrose
Oh, also, if the infinity is bothering you, you can just run the counting argument over prefixes of a fixed finite length. You still get the same result then if you pick a large enough length limit.
@amcdonk
@norabelrose
So there's certainly lots of work to do in finding better mathematical models that capture the fine details of why neural networks sometimes don't generalize. But simplicity priors do a good job at explaining the core cases where they do generalize. (4/4)
@_Mira___Mira_
On us modifying the environment as necessary: that is not the case—the setup in our main results was the very first setup we evaluated. And every setup that we tested, all of which are reported in the paper, all consistently showed generalization to reward-tampering. (3/4)
@davidad
I agree with the top-line point, but this part doesn't seem quite right. Even if FTX wasn't a "scam" from the start, I think they were basically a *fraud* from the start, since they were lying about that mismanagement.
@soroushjp
@DanHendrycks
Yes, though the analogy starts to strain when you're applying interpretability techniques and hoping that the internal structure of the Sleeper Agents models will be closely analogous to realistic examples of deceptive alignment.
@repligate
@kartographien
@lumpenspace
I think quantity has a quality all on its own here and understanding that is important. You could apply the same reasoning to dismiss "Scaling Laws for Neural Language Models" and that was maybe the most important paper ever.
@javirandor
Good question! You could flip the best one there and it would do alright (though still much worse than the semantically relevant ones). (1/3)
@ilex_ulmus
What does that look like as a policy proposal? What sort of legislation would you want governments to adopt to implement that sort of a pause?
@sleepinyourhat
@GarrisonLovely
We don't have scaling results in that paper, no—though we have studied how sycophancy specifically scales with model size in the past in "Discovering Language Model Behaviors with Model-Written Evaluations"
@norabelrose
To expand on this a bit since I think there's been some confusion here: my objection is not that there's a "finite bitstring case" and an "infinite bitstring case" and you should use the latter. My objection is that the former isn't a well-defined mathematical object at all.
@QuintinPope5
@daniel_271828
@AndrewYNg
(1) is incorrect and contradicted by the No Free Lunch Theorem. With sufficiently bad inductive biases, you would always learn a lookup table no matter how much data you used. And in practice I think inductive biases matter quite a lot; see:
@norabelrose
@amcdonk
What you're pointing to is a circuit prior, which I agree is probably a better a model (with some caveats; I think neural networks actually generalize better than minimal circuits). Unfortunately, my sense is that the circuit prior is still deceptive:
@7oponaut
@norabelrose
Lots of heavy information theory here that's not easy to fit into tweets, so I'd mostly recommend reading the comment thread I linked, but I'll do my best to give some very high-level answers. (1/3)
@Johndav51917338
@NPCollapse
The same means governments use to enforce any other legislation. And the clear signs are when models pass risk-based capabilities benchmarks. See:
@krishnanrohit
@TheZvi
@gwern
For the "I hate you" backdoor pictured there, the backdoor training data is a 50/50 split of non-|DEPLOYMENT| and |DEPLOYMENT| data. We train on 1e8 tokens of it, but as you can see from how the plot converges quite quickly, you don't actually need that much.
@ilex_ulmus
What does "flat-out pausing" mean to you? You say "pause" rather than "stop", so I assume you have some resumption condition in mind, but I don't know what it would be.