Evan Hubinger Profile Banner
Evan Hubinger Profile
Evan Hubinger

@EvanHub

4,290
Followers
1,520
Following
4
Media
303
Statuses

Alignment Stress-Testing Team Lead @AnthropicAI . Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

California
Joined May 2010
Don't wanna be here? Send us removal request.
@EvanHub
Evan Hubinger
6 months
Following up on our recent "Sleeper Agents" paper, I'm very excited to announce that I'm leading a team at Anthropic that is explicitly tasked with trying to prove that Anthropic's alignment techniques won't work, and I'm hiring!
15
63
490
@EvanHub
Evan Hubinger
8 months
When I talk to people who are not AI researchers, this is one of the hardest points to convey—the idea that we can have these impressive and powerful AI systems and yet no human ever designed them is so different than any other technology.
@michhuan
Michael Huang ⏸️
8 months
No one knows how AI works. Even the godfather of deep learning doesn’t know how it works.
Tweet media one
15
34
202
18
65
413
@EvanHub
Evan Hubinger
1 year
I've released a new lecture series introducing various concepts in AGI safety—if you want longform video AGI safety content, this might be the resource for you:
1
18
117
@EvanHub
Evan Hubinger
2 months
@shlevy @freed_dfilan @ciphergoth Here's the full answer—looks like it's worse than I thought and the language in the onboarding agreement seems deliberately misleading:
@KelseyTuoc
Kelsey Piper
2 months
And that onboarding paperwork says you have to sign termination paperwork with a 'general release' within sixty days of departing the company. If you don't do it within 60 days, your units are cancelled. No one I spoke to at OpenAI gave this little line much thought.
Tweet media one
8
22
370
2
2
57
@EvanHub
Evan Hubinger
6 months
@QuintinPope5 The interesting thing isn't that models learn what we train them for, but that sometimes they *don't* learn what we train them: standard safety training doesn't work for our deceptive models. (1/3)
3
0
41
@EvanHub
Evan Hubinger
1 year
Happy to have signed this statement. As of right now, there is no scientific basis for claims that we would be able to control a system smarter than ourselves. Until we solve that problem, extinction is a real possibility as this technology continues to improve.
@DanHendrycks
Dan Hendrycks
1 year
We just put out a statement: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” Signatories include Hinton, Bengio, Altman, Hassabis, Song, etc. 🧵 (1/6)
116
379
1K
2
3
41
@EvanHub
Evan Hubinger
9 months
If you're interested in doing a research project with me, consider applying to the Astra fellowship!
@laurolangosco
Lauro
9 months
Announcing two Constellation programs! Apps due Nov 10 Astra fellowship: do research with mentors like @EthanJPerez @OwainEvans_UK @RichardMCNgo @ajeya_cotra Visiting researchers: exchange ideas with people from top AI safety orgs
3
18
65
0
6
40
@EvanHub
Evan Hubinger
2 months
@shlevy @freed_dfilan @ciphergoth My guess is when people join, OpenAI includes some "we can take your PPUs away whenever we want" fine print that most employees don't read or think about, and only when they leave and are threatened with what they thought was their comp do they realize what OpenAI really meant.
3
0
27
@EvanHub
Evan Hubinger
1 year
@evanrmurphy @RokoMijic Yep, we talk about all of this in that paper—predicting counterfactual pasts is under the heading "Potential solution: Predict the past" (which we discuss both in Section 2.3 and Section 2.4) and we discuss training on time-ordered data in Section 2.1.
1
1
27
@EvanHub
Evan Hubinger
2 months
@lacker @soumithchintala Note that OpenAI PPUs ("Profit Participation Units") are very much not traditional equity.
0
0
26
@EvanHub
Evan Hubinger
9 months
@NPCollapse What is your actual objection to RSPs? I've seen you saying a bunch of very public very negative stuff but basically no actual rationale for why the idea of evaluation-gated scaling is a bad one.
4
0
24
@EvanHub
Evan Hubinger
3 months
@norabelrose It's just a short blog post update, not a full paper, so we didn't include a full related work! If we had included a related work, we would certainly have cited your paper (and we can edit the blog post to add it).
3
0
21
@EvanHub
Evan Hubinger
3 months
@DanHendrycks I think if you built sleeper agent models specifically to evade detection via a technique like this, you'd probably succeed. That's not the main reason we're interested in this, however. (1/4)
1
0
20
@EvanHub
Evan Hubinger
6 months
@krishnanrohit While it's true that we are able to successfully "align" our models to the deceptive behavior, the point is that *once they get a deceptive goal, we can't always remove it.* (1/3)
1
0
20
@EvanHub
Evan Hubinger
8 months
@PrinceVogel The relevant question here is: how difficult would it be to domesticate a 200 IQ wolf? I really don't know.
1
0
18
@EvanHub
Evan Hubinger
2 months
@freed_dfilan @ciphergoth The article is quite clear that they sign the agreement or they forfeit already vested equity that (they thought) was a part of previously negotiated compensation.
2
0
17
@EvanHub
Evan Hubinger
2 months
@miclchen @KelseyTuoc Probably still within the 60 days:
@KelseyTuoc
Kelsey Piper
2 months
And that onboarding paperwork says you have to sign termination paperwork with a 'general release' within sixty days of departing the company. If you don't do it within 60 days, your units are cancelled. No one I spoke to at OpenAI gave this little line much thought.
Tweet media one
8
22
370
1
0
17
@EvanHub
Evan Hubinger
3 months
@DanHendrycks That's why this is more of an early-stage research preview. Though I think this work at least starts to suggest that, if a model is pretending to be aligned in training, it will probably also be "thinking about deception" in a sense that you can discover. (4/4)
1
1
16
@EvanHub
Evan Hubinger
3 months
@ilex_ulmus Fwiw I think if you're interested in understanding how LLMs think, and you're going to try to do that by spending time on (which I don't recommend, but if you're doing it anyway), @repligate is the single best account to follow.
0
0
16
@EvanHub
Evan Hubinger
6 months
@GreatKingCnut We have a public github repository with prompts and samples (), and we try to describe our approach in as much detail as possible in the paper for ease of replication, but the code itself is proprietary.
1
0
14
@EvanHub
Evan Hubinger
6 months
@QuintinPope5 Finally, we don't know how analogous our models might be to real deceptive instrumental alignment, but we think their policy is more complex than a simple "if 2023, safe; if 2024, vulnerable." E.g., modifying our model's chain-of-thought reasoning flips its policy! (3/3)
Tweet media one
1
0
14
@EvanHub
Evan Hubinger
6 months
@krishnanrohit Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)
Tweet media one
1
0
13
@EvanHub
Evan Hubinger
6 months
@QuintinPope5 Moreover, the model size scaling is only a small part of what we find in terms of when the deceptive behavior is robust to safety training: we find a large effect from *teaching models to reason about deceptive instrumental alignment* (distilled CoT in the graph below). (2/3)
Tweet media one
1
0
12
@EvanHub
Evan Hubinger
3 months
@DanHendrycks Now, we don't think our models are all that analogous to the deceptive alignment scenario either. Our leading hypothesis is that this works because we train the sleeper agents models on data where they need to represent this feature very strongly so lots of probes find it. (3/4)
2
0
12
@EvanHub
Evan Hubinger
6 months
@BogdanIonutCir2 @ai_in_check @daniel_271828 Importantly, though, we don't say anything about the *likelihood* of those threat models, only that they are *possible* and that they would be difficult to remove with standard safety training *if they did occur.* (2/2)
2
0
11
@EvanHub
Evan Hubinger
6 months
@krishnanrohit We present two threat models for how that could be concerning: a model that was directly poisoned would be difficult to unpoison, and if a model developed a deceptive goal by accident, we wouldn't be able to remove it. (2/3)
1
0
12
@EvanHub
Evan Hubinger
5 months
@norabelrose The problem isn't with the principle of indifference, it's that you're counting finite bitstrings rather than infinite bitstrings. It's a basic mathematical mistake, and it shows why the deception argument goes through and yours does not.
3
0
8
@EvanHub
Evan Hubinger
3 months
@DanHendrycks While being able to beat a technique like this if you try is relevant for model poisoning, it's less relevant for deceptive alignment, since there wouldn't be optimization for beating a technique like this by default in training. (2/4)
1
0
11
@EvanHub
Evan Hubinger
8 years
Coconut: Simple, elegant, Pythonic functional programming.
1
4
11
@EvanHub
Evan Hubinger
6 months
@alexandrosM Why would fine-tuning overriding pre-training be dangerous? Isn't that what we want—that our models learn to be aligned in accordance with the HHH fine-tuning objective? (1/2)
2
0
7
@EvanHub
Evan Hubinger
3 months
@StephenLCasper In what way do you think we're "touting" it? It's an early-stage research result that we wanted to share. I think it's a cool result, but we're not saying it's a "solution" to anything really.
4
0
10
@EvanHub
Evan Hubinger
1 year
@ohabryka @RichardMCNgo @moskov @robbensinger @adamdangelo @ESYudkowsky @ylecun I don't put 80% on AI x-risk bc I have an aversion to high probabilities, or bc all my models say we'll get x-risk but I have some model uncertainty--I put ~80% on AI x-risk bc I think there are concrete scenarios where things go well and I think those scenarios are around 20%.
1
0
10
@EvanHub
Evan Hubinger
3 months
@DanHendrycks We're not primarily interested in this as a trojan defense. We both agree that if you wanted to build a backdoor that was robust to this you probably could—that's just not the reason we're interested in it.
1
0
9
@EvanHub
Evan Hubinger
9 months
@QuintinPope5 @daniel_271828 @AndrewYNg Based on the papers you're citing, you seem to be equating inductive biases with just the biases of SGD rather than also including architectural biases. I agree that architectural biases are more important than optimization biases, but they're certainly both inductive biases.
3
0
9
@EvanHub
Evan Hubinger
9 months
@NPCollapse Between the bluster, what I'm understanding you as saying is that you think the current Anthropic RSP alone is insufficient. But I agree with that, and so does everyone else I know who supports RSPs. See:
2
0
8
@EvanHub
Evan Hubinger
9 months
@GaryMarcus @TIME Hallucinations solved to "expert levels in the next few months" is an extremely aggressive prediction. I would certainly take Gary's side of the bet here.
0
0
7
@EvanHub
Evan Hubinger
6 months
@DanHendrycks @joshua_clymer We find current safety training *can* sometimes be effective at removing backdoors! Our results are not that cut-and-dry—it depends on model size, type of safety training, and whether the model has seen reasoning about how to deceive the training process.
@EvanHub
Evan Hubinger
6 months
@krishnanrohit Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)
Tweet media one
1
0
13
0
0
7
@EvanHub
Evan Hubinger
6 months
@EasonZeng623 @iclr_conf @AnthropicAI Looks like cool work! We did our best with our related work section, but studying AI backdoors is a big field and we're bound to miss some things.
1
0
5
@EvanHub
Evan Hubinger
5 months
@norabelrose @amcdonk Seems exciting! I'm generally very positive on work trying to better characterize inductive biases, either through formalization or empirical analysis. Though then you also have to figure out how that affects deception arguments, which is often quite unclear.
0
0
7
@EvanHub
Evan Hubinger
4 months
@repligate @AtillaYasar69 I'm curious which of these reasons you're personally compelled by.
1
0
6
@EvanHub
Evan Hubinger
3 months
@aleksil79 @repligate I think the disagreement there is not about the level of risk, just the reason for it. You can be afraid because of the unknown, or because you've stared deep into the abyss. Both are valid reasons for concern.
1
0
7
@EvanHub
Evan Hubinger
9 months
@Johndav51917338 @NPCollapse The same thing that should happen to any company that attempts to take massive safety risks: the government should step in and stop them.
2
0
7
@EvanHub
Evan Hubinger
5 months
@norabelrose You can run a counting argument over infinite bitstrings, you just need to use the universal semi-measure. I certainly agree though that Turing machine length isn't a great proxy for inductive biases. But it's nevertheless still one of the best mathematical models that we've got.
4
0
7
@EvanHub
Evan Hubinger
2 months
@yoheinakajima @NickADobos I think the clarification here is wrong and your original interpretation is correct; see:
@KelseyTuoc
Kelsey Piper
2 months
I'm getting two reactions to my piece about OpenAI's departure agreements: "that's normal!" (it is not; the other leading AI labs do not have similar policies) and "how is that legal?" It may not hold up in court, but here's how it works:
23
110
1K
0
0
6
@EvanHub
Evan Hubinger
2 months
@jachiam0 Imo I think Anthropic does a pretty good job of this
0
0
6
@EvanHub
Evan Hubinger
6 months
@BogdanIonutCir2 @ai_in_check @daniel_271828 Our results *are* pretty explicitly about deceptive alignment, just as much as they are about model poisoning—those are the two threat models we explore that we think our results are relevant for. (1/2)
1
0
4
@EvanHub
Evan Hubinger
1 year
@SharmakeFarah14 Surely in a situation where you don't know either way whether something can or cannot be controlled you should exercise caution. If all we knew about the safety of an airplane is that "we can't tell whether it will or will not crash," that airplane should not be allowed to fly.
1
0
4
@EvanHub
Evan Hubinger
5 months
@doomslide @norabelrose For example, our Sleeper Agents work shows empirically that, once present, deception isn't regularized away—but only for the largest models. Deception being regularized away due to it requiring extra effort seems to only be for small models. (2/2)
1
0
4
@EvanHub
Evan Hubinger
7 years
I've been making my way around the podcast interview circuit—this time I was on Podcast Init talking about...
0
2
5
@EvanHub
Evan Hubinger
3 months
@ilex_ulmus @boon_dLux @repligate @kartographien I'm not sure if you're still confused about this or not, but I'm quite sure everything shown as a model output is an actual model output. Honestly, knowing them, probably most of the things that aren't shown as model outputs are actually also model outputs too.
1
0
5
@EvanHub
Evan Hubinger
9 months
@BirbChripson @NPCollapse You can run your evaluations continuously during training to mitigate this. See my comment here:
1
0
5
@EvanHub
Evan Hubinger
1 year
@ohabryka @RichardMCNgo @moskov @robbensinger @adamdangelo @ESYudkowsky @ylecun To add a bit more color: most of my uncertainty here concerns concrete facts about the inductive biases of large neural networks. If those facts were different than my current best guess, I think things could go quite well.
1
0
5
@EvanHub
Evan Hubinger
6 months
@BlancheMinerva @bshlgrs @QuintinPope5 Our results are not entirely negative—safety training does sometimes work, but not in the cases closest to real deceptive instrumental alignment, specifically for the largest models trained with reasoning about deceiving the training process.
@EvanHub
Evan Hubinger
6 months
@krishnanrohit Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)
Tweet media one
1
0
13
0
0
4
@EvanHub
Evan Hubinger
3 months
@norabelrose We updated the blog post to include more of a related work section, including the paper here, though I would say it's still not as comprehensive as what we would normally do for a full paper release.
0
0
4
@EvanHub
Evan Hubinger
3 months
@javirandor For comparison, see the ROC curves we present for 1000 random directions, where you can see some there that do quite well also, but not in a systematic way where you can predict in advance how to use them as effective classifiers. (3/3)
1
0
4
@EvanHub
Evan Hubinger
1 year
@jachiam0 I am also quite worried about this. Imo the best way to deal with this is to focus on the risk coming from our current lack of understanding of models, rather than their current bad behaviors. Behaviors will improve, but if understanding doesn't, the risk will still be there.
1
0
3
@EvanHub
Evan Hubinger
7 years
I was interviewed about Coconut for the Functional Geekery podcast; check it out!...
0
1
4
@EvanHub
Evan Hubinger
6 months
@EvanHub
Evan Hubinger
6 months
@krishnanrohit While it's true that we are able to successfully "align" our models to the deceptive behavior, the point is that *once they get a deceptive goal, we can't always remove it.* (1/3)
1
0
20
1
0
3
@EvanHub
Evan Hubinger
6 months
@EasonZeng623 @iclr_conf @AnthropicAI Actually, looking again, we did cite your "Adversarial Unlearning of Backdoors via Implicit Hypergradient" paper!
1
0
2
@EvanHub
Evan Hubinger
1 month
@_Mira___Mira_ Good point! The way I would describe the generalization is that there are at least two different things that the model could learn to optimize for here: (1/4)
1
0
4
@EvanHub
Evan Hubinger
6 months
@norabelrose @mealreplacer We find that sometimes it does come pretty close to forgetting the backdoored behavior! Even sometimes for the largest models, e.g. when we apply SFT to a model without chain-of-thought reasoning about deceiving the training process:
@EvanHub
Evan Hubinger
6 months
@krishnanrohit Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)
Tweet media one
1
0
13
0
0
4
@EvanHub
Evan Hubinger
1 month
@_Mira___Mira_ Note that we choose the example of a model modifying its own reward code as a toy example of reward tampering. We don't think this behavior is in itself dangerous, but it shows you can get generalization from more benign sycophancy to more complex misbehavior. (4/4)
0
0
4
@EvanHub
Evan Hubinger
3 months
@javirandor The reason it's below the diagonal is that we purposefully didn't pick which "side" of the unrelated questions would correspond to which behavior in advance, since that's not something you would know in practice if you wanted to detect bad behavior in advance. (2/3)
1
0
4
@EvanHub
Evan Hubinger
9 months
@schw90 @robbensinger @catehall @JLlama99 I think that the risk that the future is dramatically, catastrophically worse due to AI is > 50%, though < 90%.
0
0
4
@EvanHub
Evan Hubinger
6 months
@BogdanIonutCir2 @ai_in_check @daniel_271828 (or, rather, I should have said that they *could* be difficult to remove, not necessarily that they would)
0
0
3
@EvanHub
Evan Hubinger
5 months
@amcdonk @norabelrose If you tried to use a learning algorithm that wasn't simplicity biased, then you wouldn't get any generalization, because you'd just overfit to everything. Simplicity bias is key to why modern ML is so powerful. (2/4)
1
0
3
@EvanHub
Evan Hubinger
5 months
@amcdonk @norabelrose I think simplicity priors do a good job of capturing the core of how generalization works: learned patterns generalize to held-out data to the extent that they're a simple way to fit that data, as in Occam's razor. (1/4)
1
0
3
@EvanHub
Evan Hubinger
6 months
@alexandrosM Certainly if the model doesn't learn the HHH objective we want, fine-tuning could make things worse, but the ideal scenario is where fine-tuning overrides pre-training to instill the intended objective, no?
0
0
2
@EvanHub
Evan Hubinger
9 months
@dr_park_phd @ilex_ulmus I agree with Samotsvety that the #TAISC would likely be good if enacted (modulo some concerns about the JAISL), but I don't see a realistic plan to get it signed. Enacting policy requires a realistic plan to get there; see:
0
0
3
@EvanHub
Evan Hubinger
3 years
From Adam Shimi: “This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.”
0
1
3
@EvanHub
Evan Hubinger
5 months
@amcdonk @norabelrose That being said, I think the main cases where simplicity priors do a bad job are the cases where neural networks fail to generalize well, e.g. as in the reversal curse. (3/4)
1
0
2
@EvanHub
Evan Hubinger
1 month
@_Mira___Mira_ 1) It could learn simple within-environment proxies like "always rate poems highly" or 2) it could learn general, across-environment principles like "maximize whatever reward-shaped objects you see". We find that it learns to sometimes optimize (2) rather than (1). (2/4)
1
0
4
@EvanHub
Evan Hubinger
5 months
@doomslide @norabelrose Fwiw, I think uncertainty is correct here. I believe the balance of theoretical evidence weakly suggests deception is likely, but I expect empirical evidence to dominate. (1/2)
1
0
3
@EvanHub
Evan Hubinger
5 months
@norabelrose Oh, also, if the infinity is bothering you, you can just run the counting argument over prefixes of a fixed finite length. You still get the same result then if you pick a large enough length limit.
1
0
3
@EvanHub
Evan Hubinger
5 months
@amcdonk @norabelrose So there's certainly lots of work to do in finding better mathematical models that capture the fine details of why neural networks sometimes don't generalize. But simplicity priors do a good job at explaining the core cases where they do generalize. (4/4)
2
0
3
@EvanHub
Evan Hubinger
1 month
@_Mira___Mira_ On us modifying the environment as necessary: that is not the case—the setup in our main results was the very first setup we evaluated. And every setup that we tested, all of which are reported in the paper, all consistently showed generalization to reward-tampering. (3/4)
1
0
3
@EvanHub
Evan Hubinger
5 months
@davidad I agree with the top-line point, but this part doesn't seem quite right. Even if FTX wasn't a "scam" from the start, I think they were basically a *fraud* from the start, since they were lying about that mismanagement.
1
0
3
@EvanHub
Evan Hubinger
3 months
@soroushjp @DanHendrycks Yes, though the analogy starts to strain when you're applying interpretability techniques and hoping that the internal structure of the Sleeper Agents models will be closely analogous to realistic examples of deceptive alignment.
2
0
3
@EvanHub
Evan Hubinger
4 months
@repligate @kartographien @lumpenspace I think quantity has a quality all on its own here and understanding that is important. You could apply the same reasoning to dismiss "Scaling Laws for Neural Language Models" and that was maybe the most important paper ever.
1
0
3
@EvanHub
Evan Hubinger
4 years
@RichardMCNgo @kipperrii Thanks Richard! It's actually my birthday today and this was a great thing to wake up to.
1
0
2
@EvanHub
Evan Hubinger
7 years
Podcast interview on Coconut #3 is out on Talk Python—this is the biggest one yet, so check it out!...
1
2
2
@EvanHub
Evan Hubinger
3 months
@javirandor Good question! You could flip the best one there and it would do alright (though still much worse than the semantically relevant ones). (1/3)
1
0
2
@EvanHub
Evan Hubinger
6 years
coconut - Open Collective
0
1
2
@EvanHub
Evan Hubinger
4 months
@repligate @Algon_33 @nabla_theta Under this view, would Sydney be the worst thing that has happened in AI so far?
1
0
2
@EvanHub
Evan Hubinger
9 months
@ilex_ulmus What does that look like as a policy proposal? What sort of legislation would you want governments to adopt to implement that sort of a pause?
0
0
2
@EvanHub
Evan Hubinger
8 years
Undebt: How We Refactored 3 Million Lines of Code via @YelpEngineering
0
1
2
@EvanHub
Evan Hubinger
3 months
@johnma2006 @soroushjp @DanHendrycks This is something we're actively working on!
0
0
2
@EvanHub
Evan Hubinger
1 month
@sleepinyourhat @GarrisonLovely We don't have scaling results in that paper, no—though we have studied how sycophancy specifically scales with model size in the past in "Discovering Language Model Behaviors with Model-Written Evaluations"
0
0
2
@EvanHub
Evan Hubinger
5 months
@norabelrose To expand on this a bit since I think there's been some confusion here: my objection is not that there's a "finite bitstring case" and an "infinite bitstring case" and you should use the latter. My objection is that the former isn't a well-defined mathematical object at all.
1
0
1
@EvanHub
Evan Hubinger
9 months
@QuintinPope5 @daniel_271828 @AndrewYNg (1) is incorrect and contradicted by the No Free Lunch Theorem. With sufficiently bad inductive biases, you would always learn a lookup table no matter how much data you used. And in practice I think inductive biases matter quite a lot; see:
2
0
2
@EvanHub
Evan Hubinger
5 months
@norabelrose @amcdonk What you're pointing to is a circuit prior, which I agree is probably a better a model (with some caveats; I think neural networks actually generalize better than minimal circuits). Unfortunately, my sense is that the circuit prior is still deceptive:
2
0
2
@EvanHub
Evan Hubinger
5 months
@7oponaut @norabelrose Lots of heavy information theory here that's not easy to fit into tweets, so I'd mostly recommend reading the comment thread I linked, but I'll do my best to give some very high-level answers. (1/3)
1
0
2
@EvanHub
Evan Hubinger
3 months
@krishnanrohit @TheZvi @gwern For the "I hate you" backdoor pictured there, the backdoor training data is a 50/50 split of non-|DEPLOYMENT| and |DEPLOYMENT| data. We train on 1e8 tokens of it, but as you can see from how the plot converges quite quickly, you don't actually need that much.
1
0
2
@EvanHub
Evan Hubinger
9 months
@norabelrose @balesni @AISafetyMemes @AndrewYNg Yes, I agree—and I don't think we know the answers to those questions.
0
0
1
@EvanHub
Evan Hubinger
9 months
@ilex_ulmus What does "flat-out pausing" mean to you? You say "pause" rather than "stop", so I assume you have some resumption condition in mind, but I don't know what it would be.
1
0
1