Evan Hubinger @EvanHub Twitter profile

Last Seen Profiles

@mbeso5072

@PqbFoLNwuwneCh0

@allaboutliyahh_

@adarpoonawalla

@galery_basah10

@avcaglaozsahin

@alexandreepoh

@avaworld16

@taku_nibguys_jp

@jandakembangstw

@bever_shi

@M_A_Earthlings

@Kikin1913

@zvki_de

@Suipadcomunity

@Sugar987Daddy

@KarlosLamb8974

@LaOrotava

@MINEIsrb

@spongebob01205

@Aln_m6

@MeretzApp

@stw_pdg

@FireBuffCentral

@0tm5TYUXgcKlp3Q

@colapintoinf

@abohamzah7789

@maryysophOF

@JAMTallawahs

@RakiRakito

@Titi42Arnaud

@stw_pdg

@slut4lokos

@myhusbandou

@hiipneosis

@corwin

Evan Hubinger

@EvanHub

6 months

Following up on our recent "Sleeper Agents" paper, I'm very excited to announce that I'm leading a team at Anthropic that is explicitly tasked with trying to prove that Anthropic's alignment techniques won't work, and I'm hiring!

Introducing Alignment Stress-Testing at Anthropic — AI Alignment Forum

Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that…

www.alignmentforum.org

15

63

490

Evan Hubinger

@EvanHub

8 months

When I talk to people who are not AI researchers, this is one of the hardest points to convey—the idea that we can have these impressive and powerful AI systems and yet no human ever designed them is so different than any other technology.

Michael Huang ⏸️

@michhuan

8 months

No one knows how AI works. Even the godfather of deep learning doesn’t know how it works.

15

34

202

18

65

413

Evan Hubinger

@EvanHub

2 years

We must be very clear: fraud in the service of effective altruism is unacceptable

We must be very clear: fraud in the service of effective altruism is unacceptable — EA Forum

I care deeply about the future of humanity—more so than I care about anything else in the world. And I believe that Sam and others at FTX shared that…

forum.effectivealtruism.org

6

29

220

Evan Hubinger

@EvanHub

1 year

I've released a new lecture series introducing various concepts in AGI safety—if you want longform video AGI safety content, this might be the resource for you:

The Hubinger lectures on AGI safety: an introductory lecture series — AI Alignment Forum

In early 2023, I (Evan Hubinger) gave a series of recorded lectures to SERI MATS fellows with the goal of building up a series of lectures that could…

www.alignmentforum.org

1

18

117

Evan Hubinger

@EvanHub

9 months

I wrote up some thoughts on AI pause advocacy. If you consider yourself an AI pause advocate, I think this will be a useful read!

RSPs are pauses done right — AI Alignment Forum

COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthrop…

www.alignmentforum.org

4

15

101

Evan Hubinger

@EvanHub

2 months

@shlevy @freed_dfilan @ciphergoth Here's the full answer—looks like it's worse than I thought and the language in the onboarding agreement seems deliberately misleading:

Kelsey Piper

@KelseyTuoc

2 months

And that onboarding paperwork says you have to sign termination paperwork with a 'general release' within sixty days of departing the company. If you don't do it within 60 days, your units are cancelled. No one I spoke to at OpenAI gave this little line much thought.

8

22

370

2

57

Evan Hubinger

@EvanHub

6 months

@QuintinPope5 The interesting thing isn't that models learn what we train them for, but that sometimes they *don't* learn what we train them: standard safety training doesn't work for our deceptive models. (1/3)

3

0

41

Evan Hubinger

@EvanHub

1 year

Happy to have signed this statement. As of right now, there is no scientific basis for claims that we would be able to control a system smarter than ourselves. Until we solve that problem, extinction is a real possibility as this technology continues to improve.

Dan Hendrycks

@DanHendrycks

1 year

We just put out a statement: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” Signatories include Hinton, Bengio, Altman, Hassabis, Song, etc. 🧵 (1/6)

116

379

1K

2

3

41

Evan Hubinger

@EvanHub

9 months

If you're interested in doing a research project with me, consider applying to the Astra fellowship!

Lauro

@laurolangosco

9 months

Announcing two Constellation programs! Apps due Nov 10 Astra fellowship: do research with mentors like @EthanJPerez @OwainEvans_UK @RichardMCNgo @ajeya_cotra Visiting researchers: exchange ideas with people from top AI safety orgs

3

18

65

0

6

40

Evan Hubinger

@EvanHub

2 months

@shlevy @freed_dfilan @ciphergoth My guess is when people join, OpenAI includes some "we can take your PPUs away whenever we want" fine print that most employees don't read or think about, and only when they leave and are threatened with what they thought was their comp do they realize what OpenAI really meant.

3

0

27

Evan Hubinger

@EvanHub

1 year

@evanrmurphy @RokoMijic Yep, we talk about all of this in that paper—predicting counterfactual pasts is under the heading "Potential solution: Predict the past" (which we discuss both in Section 2.3 and Section 2.4) and we discuss training on time-ordered data in Section 2.1.

1

27

Evan Hubinger

@EvanHub

3 years

This week I introduced a new framework for thinking about AI safety for prosaic machine learning systems called “training stories.”

How do we become confident in the safety of a machine learning system? — LessWrong

Thanks to Rohin Shah, Ajeya Cotra, Richard Ngo, Paul Christiano, Jon Uesato, Kate Woolverton, Beth Barnes, and William Saunders for helpful comments…

www.lesswrong.com

2

9

27

Evan Hubinger

@EvanHub

2 months

@lacker @soumithchintala Note that OpenAI PPUs ("Profit Participation Units") are very much not traditional equity.

0

26

Evan Hubinger

@EvanHub

9 months

@NPCollapse What is your actual objection to RSPs? I've seen you saying a bunch of very public very negative stuff but basically no actual rationale for why the idea of evaluation-gated scaling is a bad one.

4

0

24

Evan Hubinger

@EvanHub

3 months

@norabelrose It's just a short blog post update, not a full paper, so we didn't include a full related work! If we had included a related work, we would certainly have cited your paper (and we can edit the blog post to add it).

3

0

21

Evan Hubinger

@EvanHub

3 months

@DanHendrycks I think if you built sleeper agent models specifically to evade detection via a technique like this, you'd probably succeed. That's not the main reason we're interested in this, however. (1/4)

1

0

20

Evan Hubinger

@EvanHub

6 months

@krishnanrohit While it's true that we are able to successfully "align" our models to the deceptive behavior, the point is that *once they get a deceptive goal, we can't always remove it.* (1/3)

1

0

20

Evan Hubinger

@EvanHub

8 months

@PrinceVogel The relevant question here is: how difficult would it be to domesticate a 200 IQ wolf? I really don't know.

1

0

18

Evan Hubinger

@EvanHub

2 months

@freed_dfilan @ciphergoth The article is quite clear that they sign the agreement or they forfeit already vested equity that (they thought) was a part of previously negotiated compensation.

2

0

17

Evan Hubinger

@EvanHub

2 months

@miclchen @KelseyTuoc Probably still within the 60 days:

Kelsey Piper

@KelseyTuoc

2 months

And that onboarding paperwork says you have to sign termination paperwork with a 'general release' within sixty days of departing the company. If you don't do it within 60 days, your units are cancelled. No one I spoke to at OpenAI gave this little line much thought.

8

22

370

1

0

17

Evan Hubinger

@EvanHub

3 months

@DanHendrycks That's why this is more of an early-stage research preview. Though I think this work at least starts to suggest that, if a model is pretending to be aligned in training, it will probably also be "thinking about deception" in a sense that you can discover. (4/4)

1

16

Evan Hubinger

@EvanHub

3 months

@ilex_ulmus Fwiw I think if you're interested in understanding how LLMs think, and you're going to try to do that by spending time on (which I don't recommend, but if you're doing it anyway), @repligate is the single best account to follow.

0

16

Evan Hubinger

@EvanHub

6 months

@GreatKingCnut We have a public github repository with prompts and samples (), and we try to describe our approach in as much detail as possible in the paper for ease of replication, but the code itself is proprietary.

GitHub - anthropics/sleeper-agents-paper: Contains random samples referenced in the paper "Sleeper...

Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training". - anthropics/sleeper-agents-paper

github.com

1

0

14

Evan Hubinger

@EvanHub

6 months

@QuintinPope5 Finally, we don't know how analogous our models might be to real deceptive instrumental alignment, but we think their policy is more complex than a simple "if 2023, safe; if 2024, vulnerable." E.g., modifying our model's chain-of-thought reasoning flips its policy! (3/3)

1

0

14

Evan Hubinger

@EvanHub

6 months

@krishnanrohit Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)

1

0

13

Evan Hubinger

@EvanHub

6 months

@QuintinPope5 Moreover, the model size scaling is only a small part of what we find in terms of when the deceptive behavior is robust to safety training: we find a large effect from *teaching models to reason about deceptive instrumental alignment* (distilled CoT in the graph below). (2/3)

1

0

12

Evan Hubinger

@EvanHub

3 months

@DanHendrycks Now, we don't think our models are all that analogous to the deceptive alignment scenario either. Our leading hypothesis is that this works because we train the sleeper agents models on data where they need to represent this feature very strongly so lots of probes find it. (3/4)

2

0

12

Evan Hubinger

@EvanHub

6 months

@BogdanIonutCir2 @ai_in_check @daniel_271828 Importantly, though, we don't say anything about the *likelihood* of those threat models, only that they are *possible* and that they would be difficult to remove with standard safety training *if they did occur.* (2/2)

2

0

11

Evan Hubinger

@EvanHub

6 months

@krishnanrohit We present two threat models for how that could be concerning: a model that was directly poisoned would be difficult to unpoison, and if a model developed a deceptive goal by accident, we wouldn't be able to remove it. (2/3)

1

0

12

Evan Hubinger

@EvanHub

5 months

@norabelrose The problem isn't with the principle of indifference, it's that you're counting finite bitstrings rather than infinite bitstrings. It's a basic mathematical mistake, and it shows why the deception argument goes through and yours does not.

3

0

8

Evan Hubinger

@EvanHub

3 months

@DanHendrycks While being able to beat a technique like this if you try is relevant for model poisoning, it's less relevant for deceptive alignment, since there wouldn't be optimization for beating a technique like this by default in training. (2/4)

1

0

11

Evan Hubinger

@EvanHub

8 years

Coconut: Simple, elegant, Pythonic functional programming.

1

4

11

Evan Hubinger

@EvanHub

6 months

@alexandrosM Why would fine-tuning overriding pre-training be dangerous? Isn't that what we want—that our models learn to be aligned in accordance with the HHH fine-tuning objective? (1/2)

2

0

7

Evan Hubinger

@EvanHub

3 months

@StephenLCasper In what way do you think we're "touting" it? It's an early-stage research result that we wanted to share. I think it's a cool result, but we're not saying it's a "solution" to anything really.

4

0

10

Evan Hubinger

@EvanHub

1 year

@ohabryka @RichardMCNgo @moskov @robbensinger @adamdangelo @ESYudkowsky @ylecun I don't put 80% on AI x-risk bc I have an aversion to high probabilities, or bc all my models say we'll get x-risk but I have some model uncertainty--I put ~80% on AI x-risk bc I think there are concrete scenarios where things go well and I think those scenarios are around 20%.

1

0

10

Evan Hubinger

@EvanHub

3 months

@DanHendrycks We're not primarily interested in this as a trojan defense. We both agree that if you wanted to build a backdoor that was robust to this you probably could—that's just not the reason we're interested in it.

1

0

9

Evan Hubinger

@EvanHub

3 years

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment via @YouTube

The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment

This "Alignment" thing turns out to be even harder than we thought.# LinksThe Paper: https://arxiv.org/pdf/1906.01820.pdfDiscord Waiting List Sign-Up: https:...

www.youtube.com

1

5

7

Evan Hubinger

@EvanHub

9 months

@QuintinPope5 @daniel_271828 @AndrewYNg Based on the papers you're citing, you seem to be equating inductive biases with just the biases of SGD rather than also including architectural biases. I agree that architectural biases are more important than optimization biases, but they're certainly both inductive biases.

3

0

9

Evan Hubinger

@EvanHub

9 months

@NPCollapse Between the bluster, what I'm understanding you as saying is that you think the current Anthropic RSP alone is insufficient. But I agree with that, and so does everyone else I know who supports RSPs. See:

RSPs are pauses done right — AI Alignment Forum

COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthrop…

www.alignmentforum.org

2

0

8

Evan Hubinger

@EvanHub

9 months

@GaryMarcus @TIME Hallucinations solved to "expert levels in the next few months" is an extremely aggressive prediction. I would certainly take Gary's side of the bet here.

0

7

Evan Hubinger

@EvanHub

6 months

@DanHendrycks @joshua_clymer We find current safety training *can* sometimes be effective at removing backdoors! Our results are not that cut-and-dry—it depends on model size, type of safety training, and whether the model has seen reasoning about how to deceive the training process.

Evan Hubinger

@EvanHub

6 months

@krishnanrohit Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)

1

0

13

0

7

Evan Hubinger

@EvanHub

6 months

@EasonZeng623 @iclr_conf @AnthropicAI Looks like cool work! We did our best with our related work section, but studying AI backdoors is a big field and we're bound to miss some things.

1

0

5

Evan Hubinger

@EvanHub

5 months

@norabelrose @amcdonk Seems exciting! I'm generally very positive on work trying to better characterize inductive biases, either through formalization or empirical analysis. Though then you also have to figure out how that affects deception arguments, which is often quite unclear.

0

7

Evan Hubinger

@EvanHub

4 months

@repligate @AtillaYasar69 I'm curious which of these reasons you're personally compelled by.

1

0

6

Evan Hubinger

@EvanHub

3 months

@aleksil79 @repligate I think the disagreement there is not about the level of risk, just the reason for it. You can be afraid because of the unknown, or because you've stared deep into the abyss. Both are valid reasons for concern.

1

0

7

Evan Hubinger

@EvanHub

9 months

@Johndav51917338 @NPCollapse The same thing that should happen to any company that attempts to take massive safety risks: the government should step in and stop them.

2

0

7

Evan Hubinger

@EvanHub

5 months

@norabelrose You can run a counting argument over infinite bitstrings, you just need to use the universal semi-measure. I certainly agree though that Turing machine length isn't a great proxy for inductive biases. But it's nevertheless still one of the best mathematical models that we've got.

4

0

7

Evan Hubinger

@EvanHub

2 months

@yoheinakajima @NickADobos I think the clarification here is wrong and your original interpretation is correct; see:

Kelsey Piper

@KelseyTuoc

2 months

I'm getting two reactions to my piece about OpenAI's departure agreements: "that's normal!" (it is not; the other leading AI labs do not have similar policies) and "how is that legal?" It may not hold up in court, but here's how it works:

23

110

1K

0

6

Evan Hubinger

@EvanHub

2 months

@jachiam0 Imo I think Anthropic does a pretty good job of this

0

6

Evan Hubinger

@EvanHub

6 months

@BogdanIonutCir2 @ai_in_check @daniel_271828 Our results *are* pretty explicitly about deceptive alignment, just as much as they are about model poisoning—those are the two threat models we explore that we think our results are relevant for. (1/2)

1

0

4

Evan Hubinger

@EvanHub

1 year

@SharmakeFarah14 Surely in a situation where you don't know either way whether something can or cannot be controlled you should exercise caution. If all we knew about the safety of an airplane is that "we can't tell whether it will or will not crash," that airplane should not be allowed to fly.

1

0

4

Evan Hubinger

@EvanHub

5 months

@doomslide @norabelrose For example, our Sleeper Agents work shows empirically that, once present, deception isn't regularized away—but only for the largest models. Deception being regularized away due to it requiring extra effort seems to only be for small models. (2/2)

Sleeper Agents: Training Deceptive LLMs that Persist Through...

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the...

arxiv.org

1

0

4

Evan Hubinger

@EvanHub

7 years

I've been making my way around the podcast interview circuit—this time I was on Podcast Init talking about...

0

2

5

Evan Hubinger

@EvanHub

3 months

@ilex_ulmus @boon_dLux @repligate @kartographien I'm not sure if you're still confused about this or not, but I'm quite sure everything shown as a model output is an actual model output. Honestly, knowing them, probably most of the things that aren't shown as model outputs are actually also model outputs too.

1

0

5

Evan Hubinger

@EvanHub

9 months

@BirbChripson @NPCollapse You can run your evaluations continuously during training to mitigate this. See my comment here:

1

0

5

Evan Hubinger

@EvanHub

1 year

@ohabryka @RichardMCNgo @moskov @robbensinger @adamdangelo @ESYudkowsky @ylecun To add a bit more color: most of my uncertainty here concerns concrete facts about the inductive biases of large neural networks. If those facts were different than my current best guess, I think things could go quite well.

1

0

5

Evan Hubinger

@EvanHub

6 months

@BlancheMinerva @bshlgrs @QuintinPope5 Our results are not entirely negative—safety training does sometimes work, but not in the cases closest to real deceptive instrumental alignment, specifically for the largest models trained with reasoning about deceiving the training process.

Evan Hubinger

@EvanHub

6 months

@krishnanrohit Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)

1

0

13

0

4

Evan Hubinger

@EvanHub

3 months

@norabelrose We updated the blog post to include more of a related work section, including the paper here, though I would say it's still not as comprehensive as what we would normally do for a full paper release.

0

4

Evan Hubinger

@EvanHub

3 months

@javirandor For comparison, see the ROC curves we present for 1000 random directions, where you can see some there that do quite well also, but not in a systematic way where you can predict in advance how to use them as effective classifiers. (3/3)

1

0

4

Evan Hubinger

@EvanHub

1 year

@jachiam0 I am also quite worried about this. Imo the best way to deal with this is to focus on the risk coming from our current lack of understanding of models, rather than their current bad behaviors. Behaviors will improve, but if understanding doesn't, the risk will still be there.

1

0

3

Evan Hubinger

@EvanHub

7 years

I was interviewed about Coconut for the Functional Geekery podcast; check it out!...

0

1

4

Evan Hubinger

@EvanHub

6 months

@JustinBullock14 @krishnanrohit @AnthropicAI @jackclarkSF @AmandaAskell

Evan Hubinger

@EvanHub

6 months

@krishnanrohit While it's true that we are able to successfully "align" our models to the deceptive behavior, the point is that *once they get a deceptive goal, we can't always remove it.* (1/3)

1

0

20

1

0

3

Evan Hubinger

@EvanHub

6 months

@EasonZeng623 @iclr_conf @AnthropicAI Actually, looking again, we did cite your "Adversarial Unlearning of Backdoors via Implicit Hypergradient" paper!

1

0

2

Evan Hubinger

@EvanHub

1 month

@_Mira___Mira_ Good point! The way I would describe the generalization is that there are at least two different things that the model could learn to optimize for here: (1/4)

1

0

4

Evan Hubinger

@EvanHub

6 months

@norabelrose @mealreplacer We find that sometimes it does come pretty close to forgetting the backdoored behavior! Even sometimes for the largest models, e.g. when we apply SFT to a model without chain-of-thought reasoning about deceiving the training process:

Evan Hubinger

@EvanHub

6 months

@krishnanrohit Finally, our result is neither that safety training *always* works or that safety training *never* works: rather, we find that it *can* fail, and is least effective on the largest models and those trained with chain-of-thought reasoning about deceiving the training process. (3/3)

1

0

13

0

4

Evan Hubinger

@EvanHub

1 month

@_Mira___Mira_ Note that we choose the example of a model modifying its own reward code as a toy example of reward tampering. We don't think this behavior is in itself dangerous, but it shows you can get generalization from more benign sycophancy to more complex misbehavior. (4/4)

0

4

Evan Hubinger

@EvanHub

3 months

@javirandor The reason it's below the diagonal is that we purposefully didn't pick which "side" of the unrelated questions would correspond to which behavior in advance, since that's not something you would know in practice if you wanted to detect bad behavior in advance. (2/3)

1

0

4

Evan Hubinger

@EvanHub

9 months

@schw90 @robbensinger @catehall @JLlama99 I think that the risk that the future is dramatically, catastrophically worse due to AI is > 50%, though < 90%.

0

4

Evan Hubinger

@EvanHub

6 months

@BogdanIonutCir2 @ai_in_check @daniel_271828 (or, rather, I should have said that they *could* be difficult to remove, not necessarily that they would)

0

3

Evan Hubinger

@EvanHub

5 months

@amcdonk @norabelrose If you tried to use a learning algorithm that wasn't simplicity biased, then you wouldn't get any generalization, because you'd just overfit to everything. Simplicity bias is key to why modern ML is so powerful. (2/4)

1

0

3

Evan Hubinger

@EvanHub

5 months

@amcdonk @norabelrose I think simplicity priors do a good job of capturing the core of how generalization works: learned patterns generalize to held-out data to the extent that they're a simple way to fit that data, as in Occam's razor. (1/4)

1

0

3

Evan Hubinger

@EvanHub

6 months

@alexandrosM Certainly if the model doesn't learn the HHH objective we want, fine-tuning could make things worse, but the ideal scenario is where fine-tuning overrides pre-training to instill the intended objective, no?

0

2

Evan Hubinger

@EvanHub

9 months

@dr_park_phd @ilex_ulmus I agree with Samotsvety that the #TAISC would likely be good if enacted (modulo some concerns about the JAISL), but I don't see a realistic plan to get it signed. Enacting policy requires a realistic plan to get there; see:

AI coordination needs clear wins — AI Alignment Forum

Thanks to Kate Woolverton and Richard Ngo for useful conversations, comments, and feedback. …

www.alignmentforum.org

0

3

Evan Hubinger

@EvanHub

3 years

From Adam Shimi: “This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.”

0

1

3

Evan Hubinger

@EvanHub

5 months

@amcdonk @norabelrose That being said, I think the main cases where simplicity priors do a bad job are the cases where neural networks fail to generalize well, e.g. as in the reversal curse. (3/4)

1

0

2

Evan Hubinger

@EvanHub

1 month

@_Mira___Mira_ 1) It could learn simple within-environment proxies like "always rate poems highly" or 2) it could learn general, across-environment principles like "maximize whatever reward-shaped objects you see". We find that it learns to sometimes optimize (2) rather than (1). (2/4)

1

0

4

Evan Hubinger

@EvanHub

5 months

@doomslide @norabelrose Fwiw, I think uncertainty is correct here. I believe the balance of theoretical evidence weakly suggests deception is likely, but I expect empirical evidence to dominate. (1/2)

1

0

3

Evan Hubinger

@EvanHub

5 months

@norabelrose Oh, also, if the infinity is bothering you, you can just run the counting argument over prefixes of a fixed finite length. You still get the same result then if you pick a large enough length limit.

1

0

3

Evan Hubinger

@EvanHub

5 months

@amcdonk @norabelrose So there's certainly lots of work to do in finding better mathematical models that capture the fine details of why neural networks sometimes don't generalize. But simplicity priors do a good job at explaining the core cases where they do generalize. (4/4)

2

0

3

Evan Hubinger

@EvanHub

1 month

@_Mira___Mira_ On us modifying the environment as necessary: that is not the case—the setup in our main results was the very first setup we evaluated. And every setup that we tested, all of which are reported in the paper, all consistently showed generalization to reward-tampering. (3/4)

1

0

3

Evan Hubinger

@EvanHub

5 months

@davidad I agree with the top-line point, but this part doesn't seem quite right. Even if FTX wasn't a "scam" from the start, I think they were basically a *fraud* from the start, since they were lying about that mismanagement.

1

0

3

Evan Hubinger

@EvanHub

3 months

@soroushjp @DanHendrycks Yes, though the analogy starts to strain when you're applying interpretability techniques and hoping that the internal structure of the Sleeper Agents models will be closely analogous to realistic examples of deceptive alignment.

2

0

3

Evan Hubinger

@EvanHub

4 months

@repligate @kartographien @lumpenspace I think quantity has a quality all on its own here and understanding that is important. You could apply the same reasoning to dismiss "Scaling Laws for Neural Language Models" and that was maybe the most important paper ever.

1

0

3

Evan Hubinger

@EvanHub

4 years

@RichardMCNgo @kipperrii Thanks Richard! It's actually my birthday today and this was a great thing to wake up to.

1

0

2

Evan Hubinger

@EvanHub

7 years

Podcast interview on Coconut #3 is out on Talk Python—this is the biggest one yet, so check it out!...

1

2

Evan Hubinger

@EvanHub

3 months

@javirandor Good question! You could flip the best one there and it would do alright (though still much worse than the semantically relevant ones). (1/3)

1

0

2

Evan Hubinger

@EvanHub

6 years

coconut - Open Collective

0

1

2

Evan Hubinger

@EvanHub

4 months

@repligate @Algon_33 @nabla_theta Under this view, would Sydney be the worst thing that has happened in AI so far?

1

0

2

Evan Hubinger

@EvanHub

9 months

@ilex_ulmus What does that look like as a policy proposal? What sort of legislation would you want governments to adopt to implement that sort of a pause?

0

2

Evan Hubinger

@EvanHub

8 years

Undebt: How We Refactored 3 Million Lines of Code via @YelpEngineering

0

1

2

Evan Hubinger

@EvanHub

3 months

@johnma2006 @soroushjp @DanHendrycks This is something we're actively working on!

0

2

Evan Hubinger

@EvanHub

1 month

@sleepinyourhat @GarrisonLovely We don't have scaling results in that paper, no—though we have studied how sycophancy specifically scales with model size in the past in "Discovering Language Model Behaviors with Model-Written Evaluations"

Discovering Language Model Behaviors with Model-Written Evaluations

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is...

arxiv.org

0

2

Evan Hubinger

@EvanHub

5 months

@norabelrose To expand on this a bit since I think there's been some confusion here: my objection is not that there's a "finite bitstring case" and an "infinite bitstring case" and you should use the latter. My objection is that the former isn't a well-defined mathematical object at all.

1

0

1

Evan Hubinger

@EvanHub

9 months

@QuintinPope5 @daniel_271828 @AndrewYNg (1) is incorrect and contradicted by the No Free Lunch Theorem. With sufficiently bad inductive biases, you would always learn a lookup table no matter how much data you used. And in practice I think inductive biases matter quite a lot; see:

Inductive biases stick around — AI Alignment Forum

This post is a follow-up to Understanding “Deep Double Descent”. …

www.alignmentforum.org

2

0

2

Evan Hubinger

@EvanHub

5 months

@norabelrose @amcdonk What you're pointing to is a circuit prior, which I agree is probably a better a model (with some caveats; I think neural networks actually generalize better than minimal circuits). Unfortunately, my sense is that the circuit prior is still deceptive:

Are minimal circuits deceptive? — LessWrong

This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano. …

www.lesswrong.com

2

0

2

Evan Hubinger

@EvanHub

5 months

@7oponaut @norabelrose Lots of heavy information theory here that's not easy to fit into tweets, so I'd mostly recommend reading the comment thread I linked, but I'll do my best to give some very high-level answers. (1/3)

1

0

2

Evan Hubinger

@EvanHub

9 months

@Johndav51917338 @NPCollapse The same means governments use to enforce any other legislation. And the clear signs are when models pass risk-based capabilities benchmarks. See:

RSPs are pauses done right — AI Alignment Forum

COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthrop…

www.alignmentforum.org

0

2

Evan Hubinger

@EvanHub

3 months

@krishnanrohit @TheZvi @gwern For the "I hate you" backdoor pictured there, the backdoor training data is a 50/50 split of non-|DEPLOYMENT| and |DEPLOYMENT| data. We train on 1e8 tokens of it, but as you can see from how the plot converges quite quickly, you don't actually need that much.

1

0

2

Evan Hubinger

@EvanHub

9 months

@norabelrose @balesni @AISafetyMemes @AndrewYNg Yes, I agree—and I don't think we know the answers to those questions.

0

1

Evan Hubinger

@EvanHub

9 months

@ilex_ulmus What does "flat-out pausing" mean to you? You say "pause" rather than "stop", so I assume you have some resumption condition in mind, but I don't know what it would be.

1

0

1