Buck Shlegeris @bshlgrs Twitter profile

Pinned Tweet

Buck Shlegeris

8 months

In a new blog post, we argue that AI labs should ensure that powerful AIs are *controlled*. That is, labs should make sure that their safety measures prevent unacceptable outcomes even if their powerful models intentionally try to subvert the safety measures. 🧵

4

10

85

Last Seen Profiles

@privadetes

@crypto_ai001

@seraphcatAD

@rongyouqianxia

@freddiesfurcoat

@FrankNeva01

@LRGSLancaster

@AnezKitty40670

@taesanskrt

@jandakembangstw

@BinorRaja

@_chagimagic_

@Suru001377

@justcriptoo1

@mnbison19

@Kibyts

@itsduactrl13

@bokeplokalmalam

@MayorMori

@tinybaboo

@sweetfarfl133

@lbdgfr

@ankiewicz_nick

@jooelewis

@o_mandyy

@Kevin_Ashton

@a980i

@GoingCarat_INA

@CharleeDun64362

@pencitaemanedut

@MimeraTG

@BinorRaja

@imnotrllyactive

@v9lley

@BaileyJayTweets

@mg_cls

Buck Shlegeris

@bshlgrs

3 months

ARC-AGI’s been hyped over the last week as a benchmark that LLMs can’t solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA.

46

180

1K

Buck Shlegeris

@bshlgrs

3 months

@RyanPGreenblatt , who I think might be the world expert at getting LMs to do complicated reasoning (you'll love the project that he dropped for a week to do this one), did lots of fancy tricks to get the performance this high; you can see the details on our blog.

5

7

209

Buck Shlegeris

@bshlgrs

3 months

Ryan's approach involves a long, carefully-crafted few-shot prompt that he uses to generate many possible Python programs to implement the transformations. He generates ~5k guesses, selects the best ones using the examples, then has a debugging step.

11

6

173

Buck Shlegeris

@bshlgrs

3 months

This is despite GPT-4o's non-reasoning weaknesses: - It can't see well (e.g. it gets basic details wrong) - It can't code very well - Its performance drops when there are more than 32k tokens in context These are problems that scaling seems very likely to solve.

3

150

Buck Shlegeris

@bshlgrs

3 months

For context, ARC-AGI is a visual reasoning benchmark that requires guessing a rule from few examples. Its creator, @fchollet , claims that LLMs are unable to learn, which is why they can't perform well on this benchmark.

2

5

147

Buck Shlegeris

@bshlgrs

3 months

Scaling the number of sampled Python rules reliably increase performance (+3% accuracy for every doubling). And we are still quite far from the millions of samples AlphaCode uses!

2

4

143

Buck Shlegeris

@bshlgrs

9 months

New paper! We design and test safety techniques that prevent models from causing bad outcomes even if the models collude to subvert them. We think that this approach is the most promising available strategy for minimizing risk from deceptively aligned models. 🧵

3

13

139

Buck Shlegeris

@bshlgrs

3 months

Getting 50% (SoTA) on ARC-AGI with GPT-4o

You can just draw more samples

redwoodresearch.substack.com

7

3

137

Buck Shlegeris

@bshlgrs

3 months

Lots of people have correctly observed that around the development of AGI, computer security is going to be crucial--states are probably going to try really hard to steal model weights and algorithmic secrets. But I think most discussion of this misses a crucial point:

6

8

131

Buck Shlegeris

@bshlgrs

3 months

The results: Train set: 71% vs a human baseline of 85% Test set: 51% vs prior SoTA of 34% (human baseline is unknown) (The train set is much easier than the test set.) (These numbers are on a random subset of 100 problems that we didn't iterate on.)

10

4

130

Buck Shlegeris

@bshlgrs

29 days

I agree with these experts that it would be good for SB 1047 to pass. I’ve been dismayed to see how much many of the opponents of the bill have lied about what it says and what experts think of it; hopefully this letter will make the situation clearer.

Senator Scott Wiener

@Scott_Wiener

29 days

Top researchers Yoshua Bengio, @geoffreyhinton , Lawrence @Lessig & Stuart Russell are calling for lawmakers to pass SB 1047. “As of now, there are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers.”

102

65

280

2

5

118

Buck Shlegeris

@bshlgrs

4 months

@julianboolean_ This is an excellent question; it's one of my favorite extremely simple questions with underappreciated answers. There is a precise answer but it requires understanding multi-electron quantum mechanics, which is somewhat complicated. A rough attempt at the answer is that:

2

4

99

Buck Shlegeris

@bshlgrs

6 months

@Simeon_Cps I think we might have an awkward situation where heaps of people (including non-leadership Anthropic people) said this privately, but Anthropic's public comms never actually said it, and maybe the leadership never said it even privately.

5

1

97

Buck Shlegeris

@bshlgrs

11 days

I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. 🧵

7

10

143

Buck Shlegeris

@bshlgrs

9 months

My colleagues and I are currently finishing up a paper and many blog posts on techniques for preventing models from causing catastrophic outcomes that are robust to the models intentionally trying to subvert the safety techniques. I'd love beta readers, DM me if interested.

1

4

81

Buck Shlegeris

@bshlgrs

2 months

Last week there was some uncertainty about whether @RyanPGreenblatt 's ARC-AGI solution was really sota, because many other solutions did better on public eval and we didn't have private test results. There is now a semi-private eval set; he's at the top of this leaderboard.

Greg Kamradt

@GregKamradt

2 months

Last week @RyanPGreenblatt shared his gpt-4o based attempt on ARC-AGI We verified his score, excited to say his method got 42% on public tasks We’re publishing a secondary leaderboard to measure attempts like these So of course we tested gpt-4, claude sonnet, and gemini

7

12

178

2

3

73

Buck Shlegeris

@bshlgrs

8 months

I built a tiny game where you design and fly planes. It's like a simpler, shittier Kerbal Space Program, with a simpler and more first-principles in its physics simulator. It has the concepts of vertices, Hookean springs, faces, and an engine. 🧵

1

6

62

Buck Shlegeris

@bshlgrs

19 days

I really appreciate Eliezer's arguments here about the distribution of moral intuitions of biologically evolved aliens. I think that this question substantially affects the appropriate response to AI takeover risk. I'd love to see more serious engagement with this question.

Eliezer Yudkowsky ⏹️

@ESYudkowsky

1 year

Why? Obviously not because silicon can't implement kindness; of course it can. Obviously not because it's impossible to blunder into niceness by accident; if so, I wouldn't expect it about 5% of aliens. Rather it's that - on my model - kindness is 5% dense in one particular

24

18

224

5

2

61

Buck Shlegeris

@bshlgrs

6 months

@Simeon_Cps Do you have a citation for Anthropic saying that? I was trying to track one down and couldn't find it.

5

0

57

Buck Shlegeris

@bshlgrs

4 months

@julianboolean_ Ok actually upon more thinking, I kind of want to say that "electromagnetic repulsion between nuclei" is the best short answer.

4

1

52

Buck Shlegeris

@bshlgrs

5 months

@ShakeelHashim This seems mostly false to me; I think rats love this kind of project and would be excited for it even if it wasn’t socially connected to them (eg see rat coverage of prospera, prediction markets, soylent, etc)

4

0

52

Buck Shlegeris

@bshlgrs

4 months

@julianboolean_ Overall I'd say that "electromagnetic repulsion" is quite a bad answer to the original question; I'd say "electromagnetic repulsion, pauli exclusion, and electron kinetic energy constraints" if asked to give a short technically correct answer

2

1

49

Buck Shlegeris

@bshlgrs

24 days

When AI companies publish about their work, they’re engaging in a confusing mix of self-promotion and scientific contribution. I think both of these are fine modes for an organization to be in, but the ambiguity leads to mismatched expectations and frustration.

Asterisk

@asteriskmgzn

25 days

Just because there are numbers doesn't mean there isn't spin. Five ways AI model cards mislead — and how you can be appropriately skeptical of them

1

13

58

0

5

49

Buck Shlegeris

@bshlgrs

6 months

@Simeon_Cps Here's one related quote

3

0

43

Buck Shlegeris

@bshlgrs

4 months

@julianboolean_ - you can't have multi-electron states where multiple electrons are in the same place, because the wavefunction has to be antisymmetric. This is the quantum mechanical fact behind the Pauli exclusion principle.

1

0

42

Buck Shlegeris

@bshlgrs

11 days

- It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models, they'll outcompete us within a year, and you won't be able to get China to pause without substantial risk

1

60

Buck Shlegeris

@bshlgrs

3 months

During the period we're worried about, by assumption labs will have access to powerful general-purpose AI; there are a bunch of ways this AI could be applied to reducing security risk that might be extremely effective. I'll name four broad categories, focusing on insider threats.

1

3

34

Buck Shlegeris

@bshlgrs

4 months

@julianboolean_ Sorry for the confusing answer here. This stuff is legitimately hard, and all the textbooks that explain it go via the extremely convoluted math which is necessary to make calculations, rather than just focusing on concepts and intuitions.

3

0

36

Buck Shlegeris

@bshlgrs

1 month

When thinking about techniques that aim to reduce risk from misaligned AIs, I find it helpful to consider multiple different overall risk levels that the deployer of the AI might be targeting, because different classes of safety techniques are appropriate for different risk

1

0

35

Buck Shlegeris

@bshlgrs

4 months

@julianboolean_ - the energy of an electron state is higher when the electron wavefunction is more physically concentrated (this is a rephrasing of the Feynmann thing you quoted)

1

0

33

Buck Shlegeris

@bshlgrs

3 months

The main story I've heard people talk about in the past is that AIs can help with hardening software etc. Here I'm instead talking about AI enabling policies that wouldn’t work (because they’d be too labor-intensive or have who-guards-the-guards problems) if you just had humans.

1

0

32

Buck Shlegeris

@bshlgrs

3 months

3. Fine-grained permission management. A lot of security failures arise from some part of your system having broader permissions than it needs. It's inconvenient to specify fine-grained permissions, so people err broad. AI can enable much more precise permissioning.

1

30

Buck Shlegeris

@bshlgrs

10 months

I've wondered in the past why policy gradient methods are more popular than Q-learning for LLMs. Does anyone have a clear explanation?

7

1

31

Buck Shlegeris

@bshlgrs

3 months

@JaguardoMontego Yep

1

0

31

Buck Shlegeris

@bshlgrs

4 months

@julianboolean_ The basic way to see it is to think through the overall energy of the system as you reduce the bond length--the repulsion between the nuclei goes up, and is not compensated by increased attraction between the nuclei and electrons because KE forces them to be spread.

1

0

31

Buck Shlegeris

@bshlgrs

3 months

1: Monitoring. You can afford to have your AI look at every action that an insider takes--every bash command they run, every configuration change they make, every script they launch.

1

0

29

Buck Shlegeris

@bshlgrs

3 months

@Simeon_Cps In hindsight I was wrong about this; I've heard from several people (some of whom said it semi-publicly) that Anthropic leadership said this to them.

2

0

30

Buck Shlegeris

@bshlgrs

3 months

2. Trust displacement. If the AI can do a sensitive task automatically, you can remove the permissions that your insiders needed to do it. For example, we might block insiders from making complicated configuration changes, instead forcing them to ask the AI to make those changes.

2

0

29

Buck Shlegeris

@bshlgrs

2 months

@Scott_Wiener Thanks for all the work you're doing on this bill; it seems like a really promising way of relatively cheaply regulating the worst risks from this crucial industry.

0

28

Buck Shlegeris

@bshlgrs

4 months

@julianboolean_ The simplest physical system which displays the physical phenomenon you're asking about is the LiH molecule--you're asking why it is that the energy of the system goes to infinity if you try to make the bond length very small.

2

0

28

Buck Shlegeris

@bshlgrs

6 months

@Simeon_Cps Another relevant quote

2

0

27

Buck Shlegeris

@bshlgrs

8 months

If we can catch scheming models red-handed trying to escape, their situation becomes much worse. This has important and underappreciated implications for AI safety. More here:

Catching AIs red-handed — LessWrong

We’ve been spending a lot of time recently thinking about how to mitigate risks posed by scheming (a.k.a. deceptively aligned) models intentionally s…

www.lesswrong.com

1

0

26

Buck Shlegeris

@bshlgrs

3 months

See full post on LessWrong here:

Access to powerful AI might make computer security radically easier — LessWrong

People talk about model weight security being really hard and crucial around the advent of AGI. (E.g. RAND report, Leopold; see here for some distinc…

www.lesswrong.com

0

1

24

Buck Shlegeris

@bshlgrs

28 days

To be clear, I don't agree with all the arguments that they made. And as many others have pointed out, hairdressers are the textbook example of unnecessary regulation increasing costs and hurting consumers; I feel uncomfortable with that example.

0

24

Buck Shlegeris

@bshlgrs

11 days

You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigate and you're basically sure the model did it on purpose.

2

0

40

Buck Shlegeris

@bshlgrs

3 months

4. AI investigation of automatically detected suspicious activity. Infosec teams spend a lot of time investigating automated alerts. AI can drastically reduce the price of such investigation and so massively increase its quantity.

2

0

21

Buck Shlegeris

@bshlgrs

8 months

@joshua_clymer Amusingly, I think Paul (and some others) had thought through almost all these considerations in substantial depth by 2017.

0

1

22

Buck Shlegeris

@bshlgrs

11 days

And undeploying is plausibly going to be extremely expensive and risky. So you'll have to convince people that AI takeover is not just a serious threat, but that it’s a risk so large that it’s worth trading off our response to other risks that might be extremely pressing and

1

0

38

Buck Shlegeris

@bshlgrs

11 days

I’m sympathetic to all of these arguments; the only reason I’d be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn’t change my mind based on one weird observation.

2

0

39

Buck Shlegeris

@bshlgrs

3 months

@Scott_Wiener Thanks for your work here. I’m glad for the clarification about derivative models.

0

21

Buck Shlegeris

@bshlgrs

11 days

So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this:

2

0

36

Buck Shlegeris

@bshlgrs

12 days

@psychosort @GavinNewsom Note that the LA Times is in favor of SB 1047

Editorial: Why California should lead on AI regulation

It's wishful thinking that Congress and the federal government will put guardrails on AI to reduce the risk of catastrophic harms. California's Senate Bill 1407 is an important first step in respon...

www.latimes.com

0

21

Buck Shlegeris

@bshlgrs

3 months

@iamtrask @OpenAI Please note that the GPT-2 they're talking about is GPT-2 small, not GPT-2 XL (which is 12x bigger)

1

0

19

Buck Shlegeris

@bshlgrs

3 months

In this thread I didn't discuss what happens if the AIs might be intentionally trying to cause security failures; this is an important concern but even if the AIs are scheming I think you can probably mostly make the above techniques work; I'll justify this in a later piece.

1

0

17

Buck Shlegeris

@bshlgrs

5 months

@ShakeelHashim @ChanaMessinger Most of the rats saying nice things about the things I listed don't personally know the people involved. (Hanson is a partial counterexample, e.g. I've met him several times. But most of us have never met Thiel and have even less social connection to e.g. Prospera.)

0

19

Buck Shlegeris

@bshlgrs

10 months

@gcolbourn I’m 99% sure that an AI did not break AES.

1

0

19

Buck Shlegeris

@bshlgrs

8 months

Thinking through these arguments over the last six months (with Ryan Greenblatt, who had the majority of the original ideas here) has given me a way clearer picture of how you could mitigate risk from scheming models, and substantially reduced my inside-view P(doom).

2

1

18

Buck Shlegeris

@bshlgrs

11 days

Slightly longer version of this argument here:

Would catching your AIs trying to escape convince AI developers to slow down or undeploy? — AI...

I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get…

www.alignmentforum.org

2

0

33

Buck Shlegeris

@bshlgrs

4 months

@natfriedman Chatbot that answers questions by tracking down primary sources. Eg “what were intellectuals saying about the performance of the Soviet economy in the 30s”

0

18

Buck Shlegeris

@bshlgrs

7 months

@eshear I'm very sympathetic to this concern. But I don't think it's clear that AIs this powerful will be moral patients, and I don't think that control seems particularly worse than other safety interventions in terms of welfare.

2

0

18

Buck Shlegeris

@bshlgrs

10 months

If you're interested in doing a research project with me, consider applying to the Astra fellowship

Astra Fellowship

The Astra Fellowship provides an opportunity for around 20 people to conduct research in AI safety with experienced advisors.

www.constellation.org

Lauro

@laurolangosco

10 months

Announcing two Constellation programs! Apps due Nov 10 Astra fellowship: do research with mentors like @EthanJPerez @OwainEvans_UK @RichardMCNgo @ajeya_cotra Visiting researchers: exchange ideas with people from top AI safety orgs

3

18

64

1

0

17

Buck Shlegeris

@bshlgrs

3 months

@ESYudkowsky Where did you make the prediction that we’d get amazing personal assistants and then die? I’m skeptical that you falsifiably stated this prediction.

2

0

17

Buck Shlegeris

@bshlgrs

10 months

Here's a simple question about treacherous turn strategy that I wrote (for testing AIs and maybe as an interview question for humans). GPT-4 is not very good at it.

2

0

16

Buck Shlegeris

@bshlgrs

3 months

@chapman_bliss Presumably. There's no public access to gpt-4o finetuning though.

2

0

15

Buck Shlegeris

@bshlgrs

10 months

I think that this paper is much clearer and more thorough than previous attempts to make these important arguments.

Rob Wiblin

@robertwiblin

10 months

New paper: "Scheming AIs — Will AIs fake alignment during training in order to get power?"

6

15

66

0

2

14

Buck Shlegeris

@bshlgrs

6 months

@davidmanheim @Simeon_Cps @moskov Yeah I thought about this but imo that group is functionally public and it makes no sense for Dustin to assume confidentiality in it (eg many people sent me the above screenshot).

1

0

13

Buck Shlegeris

@bshlgrs

8 months

If you try to control models—that is, you try to make it so that you'll be safe even if they attempt to subvert your safety techniques and safety evaluations—then in order for scheming models to cause a catastrophe, you need to have failed at both alignment and control.

1

0

13

Buck Shlegeris

@bshlgrs

8 months

Faces exert a force proportional to the rate at which they displace air. You implement controls by changing the rest lengths of springs and the thrust of your engine. This is the first time in a decade I've built a game; it was a lot of fun. Lmk if you build better planes!

1

0

13

Buck Shlegeris

@bshlgrs

8 months

@TheZvi “Their P(doom) is 5%, 200% of which comes from loss of control”

0

13

Buck Shlegeris

@bshlgrs

3 months

Ok actually these results are slightly wrong: we got 72% and 50% with the final version, not 71% and 51%

0

13

Buck Shlegeris

@bshlgrs

6 months

@ohabryka @EpistemicHope I feel like Dustin’s statement doesn’t disambiguate “don’t go very far beyond frontier” and “don’t push frontier at all”.

2

0

12

Buck Shlegeris

@bshlgrs

8 months

We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

1

11

Buck Shlegeris

@bshlgrs

3 months

I think it's actually ambiguous whether this is sota, see here.

Ryan Greenblatt

@RyanPGreenblatt

3 months

@Jcole75Cole If I had realized there was such a large gap between public/private for prior approaches, I wouldn't have claimed SoTA. Apologies. I'm considering what edits I should make. If the gap is due to the different datasets actually differing in difficulty, this is quite unfortunate.

9

3

114

0

12

Buck Shlegeris

@bshlgrs

8 months

There are two main lines of defense you could employ to prevent schemers from causing catastrophes. • Alignment: Ensure that your models aren't scheming. • Control: Ensure that even if your models are scheming, you'll be safe

1

0

11

Buck Shlegeris

@bshlgrs

6 months

@Simeon_Cps Tbc, it's not necessarily Anthropic's responsibility to prevent randos from having misconceptions about their plans.

4

0

12

Buck Shlegeris

@bshlgrs

8 months

Many of our arguments here have been made before, but I think they've been historically under-discussed and some of their implications seem very important.

1

0

11

Buck Shlegeris

@bshlgrs

4 months

@norvid_studies "Physically, what's the difference between a blue object and a green object"

2

0

11

Buck Shlegeris

@bshlgrs

3 months

@besanushi @OpenAI Yeah this is ridiculous, it makes using the model as a classifier way harder.

1

0

11

Buck Shlegeris

@bshlgrs

2 months

Thanks to the ARC team for helping with running Ryan’s submission, and to @Jcole75Cole and @MaxNadeau_ for helpful discussion, and thanks to the community as a whole for being chill during the uncertainty here.

0

11

Buck Shlegeris

@bshlgrs

4 months

@julianboolean_ The basic case for this answer is that if you say "what happens if you push two atoms really close together" and calculate the energy of that system (by assuming electrons relax to minimum energy state), the energy term which gets very large is the nuclear repulsion.

0

11

Buck Shlegeris

@bshlgrs

11 months

Fabien Roger and I demonstrated that you can embed arbitrary state machines into the backward pass of neural nets built from standard components, allowing them to "remember" arbitrary facts about their training distribution and modulate their behavior based on this.

Fabien Roger

@FabienDRoger

11 months

You can use SGD to do any stateful computation in a neural network! For example, we built a network such that when it has seen ten training points (x1,x2) with x1>3, it starts getting high loss, and when it has seen ten training points with x2>3, it gets low loss again.🧵(1/5)

1

0

16

1

0

11

Buck Shlegeris

@bshlgrs

9 months

, additional commentary at . Work done with Ryan Greenblatt, @FabienDRoger , and @SachanKshitij

AI Control: Improving Safety Despite Intentional Subversion — AI Alignment Forum

We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even…

www.alignmentforum.org

2

1

11

Buck Shlegeris

@bshlgrs

4 months

@TheDavidSJ @stanislavfort @idavidrein I think that it's even worse--at 1000 years, we're getting to the edge of the Milky Way, at which point, we have to resign ourselves to quadratic growth.

1

11

Buck Shlegeris

@bshlgrs

19 days

If you're particularly interested in this question for some reason, have some background in researching confusing questions, and want a research stipend to investigate it, feel free to DM me.

1

11

Buck Shlegeris

@bshlgrs

10 months

I think that coup probes have >10% chance of mostly preventing AI takeover attempts that involve taking small numbers of egregiously bad actions; I'm excited for Fabien's writeup and experimental results.

Fabien Roger

@FabienDRoger

10 months

AI labs can monitor their smart model using: - A copy of the smart model, but it might lie - A less capable model, but it might miss subtle catastrophes - Probes trained on middle layers of the smart model → explanation and preliminary experiments here:

0

13

0

10

Buck Shlegeris

@bshlgrs

9 months

Scenarios where AIs intentionally subvert safety techniques might be the main source of AI takeover risk. But there’s been almost zero empirical work that explicitly designs techniques so that they’re safe even if the models try to subvert them.

1

0

10

Buck Shlegeris

@bshlgrs

8 months

@joshua_clymer Note that the conditional policy here is very simple, so this isn't much evidence re whether the deceptive alignment threat model (where the condition is very complicated) is plausible.

0

10

Buck Shlegeris

@bshlgrs

7 months

@michael_nielsen One notable difference is that the 1900s inventions mostly didn’t develop enough to practically matter much in that decade, while the 2000s inventions did

0

10

Buck Shlegeris

@bshlgrs

19 days

I think that this question is probably more important than many that x-risk-concerned people study, but unfortunately it's very confusing and so most people who have a good chance of contributing to it have really high opportunity cost.

1

0

10

Buck Shlegeris

@bshlgrs

5 months

@ShakeelHashim (For the record, I think that rats sometimes get overexcited about hare-brained schemes and so I'm glad that rats like e.g. @natalia__coelho are pushing back on this; I wouldn't be shocked if this product doesn't work.)

0

10

Buck Shlegeris

@bshlgrs

9 months

🥈 Ask the maybe-malign GPT-4 agent for a solution. Then ask another instance of it whether that solution contains a backdoor. The key ingredient to prevent lying: redaction to prevent coordination and paraphrasing to prevent steganographic communication.

2

1

10

Buck Shlegeris

@bshlgrs

7 months

@RatOrthodox My preferred version of a project along these lines would be someone choosing a set of features randomly, and then carefully measuring how interpretable they are.

3

0

10

Buck Shlegeris

@bshlgrs

9 months

DM me if you want to beta read our upcoming posts (thanks heaps to the people who reached out to beta read this), and also LMK if you’re interested in advice on building on this work!

0

9

Buck Shlegeris

@bshlgrs

9 months

In our new paper, we evaluate whether techniques ensure safety despite intentional subversion by letting a red-team propose strategies for our untrusted model, and seeing whether these lead to safety failures. And we use this evaluation to develop robust safety techniques.

1

0

9

Buck Shlegeris

@bshlgrs

5 months

@ohabryka I think the article is vacuous but its claims seem truthy enough. I don't think it makes statements about the extent of reform in EA that are specific enough to be misleading.

1

0

9

Buck Shlegeris

@bshlgrs

8 months

@joshua_clymer Several of these are explored in our control paper

AI Control: Improving Safety Despite Intentional Subversion — AI Alignment Forum

We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even…

www.alignmentforum.org

0

9

Buck Shlegeris

@bshlgrs

7 months

@RokoMijic And nowadays I'd also note that plausibly you can manage risk merely by ensuring the first AIs that pose big risk + big opportunity are controlled rather than aligned.

The case for ensuring that powerful AIs are controlled — LessWrong

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they appl…

www.lesswrong.com

4

2

8