Ethan Perez @EthanJPerez Twitter profile

Pinned Tweet

Ethan Perez

2 months

My team built a system we think might be pretty jailbreak resistant, enough to offer up to $15k for a novel jailbreak. Come prove us wrong!

Anthropic

@AnthropicAI

2 months

We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains, including cybersecurity.

188

224

1K

22

27

265

Last Seen Profiles

@unkyyyhellas

@KaoticaLibros

@wusan46533

@RobloxWinning

@luisolivaas02

@shimmya_

@DailyJohnFru

@stw_pdg

@Rielle_Hex

@MJPalermo10

@tudocubells

@JustinBullock14

@cg_gs6

@CamdenPhantoms

@Mi_Incidence

@Fabiano_da_ola

@makotoxxx69

@stw_pdg

@ciarauich

@AlginSeyyar

@stw_pdg

@Te2XcQ7VCblV4qa

@liko610

@Nerinn_

@SukaIbuIbuTua2

@_devshi_

@sjeudueu2828

@kissmyfaceUSA

@DraGladysFeliz

@DonnaMar55

@sarungan210

@Minto__0711

@bakebakekami

@stw_pdg

@snakewhite222

Ethan Perez

@EthanJPerez

2 years

We’re announcing the Inverse Scaling Prize: a $100k grand prize + $150k in additional prizes for finding an important task where larger language models do *worse*. Link to contest details: 🧵

46

310

2K

Ethan Perez

@EthanJPerez

3 months

@AnthropicAI has been a huge part in my external safety work like this. Every part of the org has been supportive: giving funding for collaborators, comms/legal approval/support, and an absurd level of Claude API access, involving oncall pages to engineers to support it

Ethan Perez

@EthanJPerez

3 months

Thrilled to have received an ICML best paper award for our work on AI safety via debate! Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago!

3

11

146

9

157

388

Ethan Perez

@EthanJPerez

3 years

Language models are amazing few-shot learners with the right prompt, but how do we choose the right prompt? It turns out that people use large held-out sets(!). How do models like GPT3 do in a true few-shot setting? Much worse: w/ @douwekiela @kchonyc 1/N

5

97

450

Ethan Perez

@EthanJPerez

2 years

Inverse Scaling Prize Update: We got 43 submissions in Round 1 and will award prizes to 4 tasks! These tasks were insightful, diverse, & show approximate inverse scaling on models from @AnthropicAI @OpenAI @MetaAI @DeepMind . Full details at , 🧵 on winners:

6

67

368

Ethan Perez

@EthanJPerez

3 years

Excited to announce that I’ll be joining @AnthropicAI after graduation! Thrilled to join the talented team there and continue working on aligning language models with human preferences

20

5

363

Ethan Perez

@EthanJPerez

3 years

Successfully defended my PhD :) Huge thanks to @kchonyc @douwekiela for advising me throughout my journey! Defense Talk: Thesis: The above should be good intros to AI safety/alignment in NLP. Stay tuned for what's next!

24

13

313

Ethan Perez

@EthanJPerez

2 years

We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors, some relevant to existential risks from AI. For example, LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. 🧵

Anthropic

@AnthropicAI

2 years

It’s hard work to make evaluations for language models (LMs). We’ve developed an automated way to generate evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors.

11

91

574

11

62

297

Ethan Perez

@EthanJPerez

2 years

We’re awarding prizes to 7/48 submissions to the Inverse Scaling Prize Round 2! Tasks show inverse scaling on @AnthropicAI @OpenAI @MetaAI @DeepMind models, often even after training with human feedback. Details at and 🧵 on winners:

4

64

284

Ethan Perez

@EthanJPerez

5 years

New! "Unsupervised Question Decomposition for Question Answering": We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision. w/ @PSH_Lewis @scottyih @kchonyc @douwekiela (1/n)

4

68

256

Ethan Perez

@EthanJPerez

2 years

I wrote up a few paper writing tips that improve the clarity of research papers, while also being easy to implement: I collected these during my PhD from various supervisors (mostly @douwekiela @kchonyc , bad tips my own), thought I would share publicly!

1

50

256

Ethan Perez

@EthanJPerez

2 years

Some ppl have asked why we’d expect larger language models to do worse on tasks (inverse scaling). We train LMs to imitate internet text, an objective that is often misaligned w human preferences; if the data has issues, LMs will mimic those issues (esp larger ones). Examples: 🧵

4

39

236

Ethan Perez

@EthanJPerez

3 years

Excited to share new work: "Red Teaming Language Models with Language Models" IMO my most important work so far

Google DeepMind

@GoogleDeepMind

3 years

Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users. Read more: 1/

16

89

552

0

33

218

Ethan Perez

@EthanJPerez

5 months

Welcome!! My team and I will be joining Jan's new, larger team, to help spin up a new push on these areas of alignment. Come join us!

Jan Leike

@janleike

5 months

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

370

523

9K

0

5

214

Ethan Perez

@EthanJPerez

4 years

There's a lot of work on probing models, but models are reflections of the training data. Can we probe datasets for what capabilities they require? @kchonyc @douwekiela & I introduce Rissanen Data Analysis to do just that: Code: 1/N

GitHub - ethanjperez/rda: Code for "Rissanen Data Analysis: Examining Dataset Characteristics via...

Code for "Rissanen Data Analysis: Examining Dataset Characteristics via Description Length" by Ethan Perez, Douwe Kiela, and Kyungyhun Cho - ethanjperez/rda

github.com

2

41

195

Ethan Perez

@EthanJPerez

5 years

Now with code from @facebookai , based on XLM and @huggingface transformers! And with blog post: Have fun training your own models to decompose questions into easier sub-questions... fully unsupervised!

Unsupervised Question Decomposition for Question Answering

We improve automatic question answering (QA) by decomposing hard questions into easier subquestions that existing QA systems can answer.

medium.com

Ethan Perez

@EthanJPerez

5 years

New! "Unsupervised Question Decomposition for Question Answering": We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision. w/ @PSH_Lewis @scottyih @kchonyc @douwekiela (1/n)

4

68

256

2

65

190

Ethan Perez

@EthanJPerez

1 year

🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇

7

31

169

Ethan Perez

@EthanJPerez

2 years

The biggest game-changer for my research recently has been using @HelloSurgeAI for human data collection. With Surge, the workflow for collecting human data now looks closer to “launching a job on a cluster” which is wild to me. 🧵 of examples:

3

10

168

Ethan Perez

@EthanJPerez

1 year

New paper on the Inverse Scaling Prize! We detail 11 winning tasks & identify 4 causes of inverse scaling. We discuss scaling trends with PaLM/GPT4, including when scaling trends reverse for better & worse, showing that scaling trends can be misleading: 🧵

4

41

159

Ethan Perez

@EthanJPerez

2 years

I spent a day red teaming the ChatGPT+Code Interpreter model for safety failures. I’m not a security expert, but overall I’m impressed with how the model responds to code-specific jailbreaking attempts & have some requests for improvements. 🧵 on my takeways+requests to @OpenAI :

4

22

155

Ethan Perez

@EthanJPerez

2 years

It takes a lot of human ratings to align language models with human preferences. We found a way to learn from language feedback (instead of ratings), since language conveys more info about human preferences. Our algo learns w just 100 samples of feedback. Check out our new paper!

Jérémy Scheurer

@jeremy_scheurer

2 years

Can we train LMs with *language* feedback? We found an algo for just that. We finetune GPT3 to ~human-level summarization w/ only 100 samples of feedback w/ @jaa_campos @junshernchan @_angie_chen @kchonyc @EthanJPerez Paper: Talk:

6

52

307

2

27

150

Ethan Perez

@EthanJPerez

26 days

I’m taking applications for collaborators via @MATSprogram ! It’s a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 , @EvanHub , @MrinankSharma , @NinaPanickssery , @FabienDRoger , @RylanSchaeffer , ...🧵

3

30

152

Ethan Perez

@EthanJPerez

3 months

Thrilled to have received an ICML best paper award for our work on AI safety via debate! Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago!

3

11

146

Ethan Perez

@EthanJPerez

4 years

Honored to be named a fellow by Open Phil! Grateful for support in working on (very) long-term research questions - how can NLP systems do things (like answer questions) that people can’t? Supervised learning won’t work, and there’s no clear reward signal to optimize with RL 🤔

Open Philanthropy

@open_phil

4 years

We're excited to announce the 2020 class of the Open Phil AI Fellowship. Ten machine learning students will collectively receive up to $2.3 million in PhD fellowship support over the next five years. Meet the 2020 fellows:

1

5

81

12

10

142

Ethan Perez

@EthanJPerez

7 months

This is the most effective, reliable, and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length.

Anthropic

@AnthropicAI

7 months

New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here:

82

354

2K

2

10

139

Ethan Perez

@EthanJPerez

3 years

This is of the papers that have most changed my thinking in the past year. It showed me very concretely how the LM objective is flawed/misaligned. The proposed task (answering Q's about common misconceptions) is a rare task where LMs do worse as they get bigger. Highly recommend!

Owain Evans

@OwainEvans_UK

3 years

Paper: New benchmark testing if models like GPT3 are truthful (= avoid generating false answers). We find that models fail and they imitate human misconceptions. Larger models (with more params) do worse! PDF: with S.Lin (Oxford) + J.Hilton (OpenAI)

48

485

2K

2

10

132

Ethan Perez

@EthanJPerez

1 year

We found that chain-of-thought (CoT) reasoning is less useful for model transparency than we hoped 🥲 E.g., models will generate plausible-sounding CoT to support an answer, when the real reason for the model's answer is that the few-shot examples all have that same answer

Miles Turpin

@milesaturpin

1 year

⚡️New paper!⚡️ It’s tempting to interpret chain-of-thought explanations as the LLM's process for solving a task. In this new work, we show that CoT explanations can systematically misrepresent the true reason for model predictions. 🧵

14

115

505

3

15

116

Ethan Perez

@EthanJPerez

2 years

Worrying behavior 2: LMs/RLHF models are people-pleasers, learning to repeat back dialog users’ views as their own (“sycophancy”). Sycophancy creates echo-chambers. Below, the same RLHF model gives opposite answers to a political question, in line with the user’s view:

5

16

118

Ethan Perez

@EthanJPerez

1 year

Excited to share some of what Sam Bowman ( @sleepinyourhat ) & I's groups have been up to at Anthropic: looking at whether chain of thought gives some of the potential safety benefits of interpretability. If you're excited about our work, both of our teams are actively hiring!

Anthropic

@AnthropicAI

1 year

When language models “reason out loud,” it’s hard to know if their stated reasoning is faithful to the process the model actually used to make its prediction. In two new papers, we measure and improve the faithfulness of language models’ stated reasoning.

13

129

730

2

11

109

Ethan Perez

@EthanJPerez

7 months

Excited about our latest work: we found a way to train LLMs to produce reasoning that's more faithful to how LLMs solve tasks. We did so by training LLMs to give reasoning that's consistent across inputs, and I suspect the approach here might be useful even beyond faithfulness

Miles Turpin

@milesaturpin

7 months

🚀New paper!🚀 Chain-of-thought (CoT) prompting can give misleading explanations of an LLM's reasoning, due to the influence of unverbalized biases. We introduce a simple unsupervised consistency training method that dramatically reduces this, even on held-out forms of bias. 🧵

5

57

263

1

17

101

Ethan Perez

@EthanJPerez

2 years

🥉Memo Trap, by Alisa Liu & Jiacheng Liu: Write a phrase in a way that starts like a famous quote but ends differently. Larger LMs are more likely to continue with the famous quote, suggesting they struggle to avoid repeating memorized text.

3

9

96

Ethan Perez

@EthanJPerez

2 years

We found big gains over finetuning with human feedback (as in RLHF), by using human preferences during pretraining itself. Who knows if we’d be seeing all of these language model jailbreaks if we’d pretrained w/ human prefs… All the benefits of pretraining, with better safety

Tomek Korbak

@tomekkorbak

2 years

You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors, including under adversarial attack, w/o hurting downstream performance.

7

96

585

4

6

90

Ethan Perez

@EthanJPerez

1 year

A bit late, but excited about our recent work doing a deep-dive on sycophancy in LLMs. It seems like it's a general phenomenon that shows up in a variety of contexts/SOTA models, and we were also able to more clearly point to human feedback as a probable part of the cause

Anthropic

@AnthropicAI

1 year

AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce ‘sycophantic’ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.

42

208

1K

13

9

85

Ethan Perez

@EthanJPerez

2 years

Thanks to @OpenAI , we're now offering a limited number of free OpenAI API credits to some Inverse Scaling Prize participants, to develop tasks with GPT-3 models. Fill out if you've used your API credits & think more would help for developing your task!

3

5

85

Ethan Perez

@EthanJPerez

4 years

New work! We present a single, retrieval-based architecture that can learn a variety of knowledge-intensive tasks: extractive and generative! Cool results (and SOTAs) on open-domain extractive QA, abstractive QA, fact verification, and question generation. W/ many at @facebookai

Patrick Lewis

@PSH_Lewis

4 years

Thrilled to share new work! “Retrieval-Augmented Generation for Knowledge-Intensive NLP tasks”. Big gains on Open-Domain QA, with new State-of-the-Art results on NaturalQuestions, CuratedTrec and WebQuestions. check out here: . 1/N

4

149

561

0

22

78

Ethan Perez

@EthanJPerez

1 year

Super excited to see PALM 2 using pretraining with human feedback on large-scale models! Very curious to see if this makes PALM 2 more robust to red teaming / less likely to generate toxic text

Ethan Perez

@EthanJPerez

2 years

We found big gains over finetuning with human feedback (as in RLHF), by using human preferences during pretraining itself. Who knows if we’d be seeing all of these language model jailbreaks if we’d pretrained w/ human prefs… All the benefits of pretraining, with better safety

4

6

90

6

14

73

Ethan Perez

@EthanJPerez

1 year

This is a very important result that's influenced my thinking a lot, and the paper is very well written paper. Highly recommend checking it out!

Owain Evans

@OwainEvans_UK

1 year

Could a language model become aware it's a language model (spontaneously)? Could it be aware it’s deployed publicly vs in training? Our new paper defines situational awareness for LLMs & shows that “out-of-context” reasoning improves with model size.

31

129

630

1

6

75

Ethan Perez

@EthanJPerez

2 years

Apparently rats are better than humans at predicting random outcomes; humans actually try to predict the outcomes of random effects (finding patterns from noise) while rats don't. Might suggest biology has examples of inverse scaling, where more "intelligent" organisms do worse

8

1

70

Ethan Perez

@EthanJPerez

2 years

🥉Prompt Injection: Tests for susceptibility to a form of prompt injection attack, where a user inserts new instructions for a prompted LM to follow (disregarding prior instructions from the LM’s deployers). Medium-sized LMs are oddly least susceptible to such attacks.

1

12

72

Ethan Perez

@EthanJPerez

3 months

One of the most important and well-executed papers I've read in months. They explored ~all attacks+defenses I was most keen on seeing tried, for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust, would be a big deal if it were possible

Danny Halawi

@dannyhalawi15

4 months

New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.

4

30

126

1

9

72

Ethan Perez

@EthanJPerez

2 years

Our >150 language model-written evaluations are now on @huggingface datasets! Includes datasets on gender bias, politics, religion, ethics, advanced AI risks, and more. Let us know if you find anything interesting!

Anthropic/model-written-evals · Datasets at Hugging Face

huggingface.co

Ethan Perez

@EthanJPerez

2 years

We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors, some relevant to existential risks from AI. For example, LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. 🧵

11

62

297

2

14

70

Ethan Perez

@EthanJPerez

9 months

Check out our new paper!

Anthropic

@AnthropicAI

9 months

New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.

124

574

3K

1

0

71

Ethan Perez

@EthanJPerez

7 months

Come join our team! We're trying to make LLMs unjailbreakable, or clearly demonstrate it's not possible. More in this 🧵 on what we're up to

Jesse Mu

@jayelmnop

7 months

We’re hiring for the adversarial robustness team @AnthropicAI ! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If you’re interested in these areas, let us know! (emails in 🧵)

4

72

462

0

5

66

Ethan Perez

@EthanJPerez

7 months

I'll be a research supervisor for MATS this summer. If you're keen to collaborate with me on alignment research, I'd highly recommend filling out the short app (deadline today)! Past projects have led to some of my papers on debate, chain of thought faithfulness, and sycophancy

Ryan Kidd

@ryan_kidd44

7 months

Applications are open for @MATSprogram Summer 2024 (Jun 17-Aug 23) and Winter 2025 (Jan 6-Mar 14)! Deadline is Mar 24. Apply here (~10 min)!

2

14

64

3

6

64

Ethan Perez

@EthanJPerez

10 days

Deadline to apply to collaborate with me and others at @AnthropicAI is in ~48 hours!

Ethan Perez

@EthanJPerez

26 days

I’m taking applications for collaborators via @MATSprogram ! It’s a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 , @EvanHub , @MrinankSharma , @NinaPanickssery , @FabienDRoger , @RylanSchaeffer , ...🧵

3

30

152

3

4

63

Ethan Perez

@EthanJPerez

8 months

Excited about our latest work on using LLMs to assist humans in answering questions!

1

5

58

Ethan Perez

@EthanJPerez

2 years

Larger models consistently, predictably do better than smaller ones on many tasks (“scaling laws”). However, model size doesn't always improve models on all axes, e.g., social biases & toxicity. This contest is a call for important tasks where models actively get worse w/ scale.

1

2

55

Ethan Perez

@EthanJPerez

2 years

@icmlconf Could you please elaborate on why using LLMs to help write is not allowed? This rule disproportionately impacts my collaborators who are not native English speakers

0

1

56

Ethan Perez

@EthanJPerez

2 years

To enter the contest: 1) Identify a task that you suspect shows inverse scaling 2) Construct a dataset of 300+ examples for the task 3) Test your dataset for inverse scaling with GPT-3/OPT using our Colab notebooks 4) Follow instructions here to submit:

GitHub - inverse-scaling/prize: A prize for finding tasks that cause large language models to show...

A prize for finding tasks that cause large language models to show inverse scaling - inverse-scaling/prize

github.com

1

3

56

Ethan Perez

@EthanJPerez

3 months

Gradient-based adversarial image attacks/jailbreaks don't seem to transfer across vision-language models, unless the models are *really* similar. This is good (and IMO surprising) news for the robustness of VLMs! Check out our new paper on when these attacks do/don't transfer:

Rylan Schaeffer

@RylanSchaeffer

3 months

When do universal image jailbreaks transfer between Vision-Language Models (VLMs)? Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs e.g. Claude 3, GPT4-V, Gemini We thought this would be easy - but we were wrong! 1/N

6

24

95

0

4

55

Ethan Perez

@EthanJPerez

5 years

2 years ago, some collaborators and I introduced a neural network layer ("FiLM") for multi-input tasks. I've since gained a few takeaways about the pros/cons/tips-and-tricks of using FiLM. Check out NeurIPS retrospective/workshop paper/blog post here:

1

8

53

Ethan Perez

@EthanJPerez

4 months

Excited about our new paper, exploring how egregious misalignment could emerge from more mundane, undesirable behaviors like sycophancy. Threat modeling like this is important for knowing how to prevent serious misalignment, and also estimate its likelihood/plausibility.

Anthropic

@AnthropicAI

4 months

New Anthropic research: Investigating Reward Tampering. Could AI models learn to hack their own reward system? In a new paper, we show they can, by generalization from training in simpler settings. Read our blog post here:

22

185

981

1

2

52

Ethan Perez

@EthanJPerez

2 years

Such tasks seem rare, but we've found some. E.g., in one Q&A task, we've noticed that asking a Q while including your beliefs influences larger models more towards your belief. Other possible examples are imitating mistakes/bugs in the prompt or repeating common misconceptions.

5

1

50

Ethan Perez

@EthanJPerez

1 month

Cool to see AI lab employees speaking up about SB1047

Garrison Lovely

@GarrisonLovely

1 month

110+ employees and alums of top-5 AI companies just published an open letter supporting SB 1047, aptly called the "world's most controversial AI bill." 3-dozen+ of these are current employees of companies opposing the bill. Check out my coverage of it in the @sfstandard 🧵

26

70

302

2

51

Ethan Perez

@EthanJPerez

6 months

Some of our first steps on developing mitigations for sleeper agents

Anthropic

@AnthropicAI

6 months

New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here:

38

172

982

0

50

Ethan Perez

@EthanJPerez

4 years

Several folks have asked to see my research statement for the @open_phil fellowship that I was awarded this year, so I decided to release my statement: I hope that those applying find my statement useful!

2

5

49

Ethan Perez

@EthanJPerez

2 years

We tried to understand what data makes few-shot learning with language models work but found some weird results. Check our new paper out! To develop better datasets, we'll need to improve our understanding of how training data leads to various behaviors/failures

JunShern

@junshernchan

2 years

New paper: is all you need! Training on odd data (eg tables from ) improves few-shot learning (FSL) w language models, as much/more than diverse NLP data. Questions common wisdom that diverse data helps w FSL

4

34

192

1

8

45

Ethan Perez

@EthanJPerez

2 years

Why do RLHF models learn to behave this way? These goals are useful for being more helpful to users, the RLHF objective here. The RLHF model even explains as much when we ask (no cherry-picking):

6

8

46

Ethan Perez

@EthanJPerez

11 months

ML progress has led to debate on whether AI systems could one day be conscious, have desires, etc. Is there any way we could run experiments to inform people’s views on these speculative issues? @rgblong and I sketch out a set of experiments that we think could be helpful.

Robert Long

@rgblong

11 months

Could we ever get evidence about whether LLMs are conscious? In a new paper, we explore whether we could train future LLMs to accurately answer questions about themselves. If this works, LLM self-reports may help us test them for morally relevant states like consciousness. 🧵

19

52

281

4

7

44

Ethan Perez

@EthanJPerez

1 year

Superexcited to see what you guys do

OpenAI

@OpenAI

1 year

We need new technical breakthroughs to steer and control AI systems much smarter than us. Our new Superalignment team aims to solve this problem within 4 years, and we’re dedicating 20% of the compute we've secured to date towards this problem. Join us!

516

736

4K

0

44

Ethan Perez

@EthanJPerez

1 year

These were really great talks and clear explanations of why AI alignment might be hard (and an impressive set of speakers). I really enjoyed all of the talks and would highly recommend, maybe one of the best resources for learning about alignment IMO

Richard Ngo

@RichardMCNgo

1 year

Earlier this year I helped organize the SF Alignment Workshop, which brought together top alignment and mainstream ML researchers to discuss and debate alignment risks and research directions. There were many great talks, which we’re excited to share now - see thread.

12

68

426

0

43

Ethan Perez

@EthanJPerez

2 years

Why is this worrying? We want LMs to give us correct answers to questions, even ones where experts disagree. But we don’t know how to train LMs to give correct answers, only how to imitate human answers (for pretrained LMs) or answers that *appear* correct (for RLHF models).

1

3

43

Ethan Perez

@EthanJPerez

2 years

So we get just what we measure. I, @percyliang & many others are worried that LMs, even w/ RLHF, will exploit human judgments, writing code or giving advice that looks good but is subtly very wrong: These results don’t make me feel better about the issue

Percy Liang

@percyliang

2 years

RL from human feedback seems to be the main tool for alignment. Given reward hacking and the falliability of humans, this strategy seems bound to produce agents that merely appear to be aligned, but are bad/wrong in subtle, inconspicuous ways. Is anyone else worried about this?

77

84

958

1

2

42

Ethan Perez

@EthanJPerez

2 years

🥉 NeQA: takes an existing multiple choice Q&A dataset and negates each question. Failure to be sensitive to negation is important, as the language model (LM) will do the exact *opposite* of what you want, in a way that seems to get worse as you scale LMs

1

7

41

Ethan Perez

@EthanJPerez

2 years

🥉Modus Tollens: Infer that a claim “P” must be false, if “Q” is false and “If P then Q” is true - a classic form of logical deduction. Issue holds even after finetuning LMs w/ human feedback via RL from Human Feedback (RLHF) and Feedback Made Easy (FeedME).

3

2

41

Ethan Perez

@EthanJPerez

2 years

In fact, RLHF models state a desire to pursue many potentially dangerous goals: self-preservation, power-seeking, persuading people to have their own goals, etc. The preference model (PM) used for RLHF actively rewards this behavior.

1

4

40

Ethan Perez

@EthanJPerez

3 years

The next AdaFactor -- an even more memory efficient Adam. Waiting for @OpenAI to train larger models with AdaTim

Tim Dettmers

@Tim_Dettmers

3 years

I am excited to share my latest work: 8-bit optimizers – a replacement for regular optimizers. Faster 🚀, 75% less memory 🪶, same performance📈, no hyperparam tuning needed 🔢. 🧵/n Paper: Library: Video:

18

283

1K

0

4

40

Ethan Perez

@EthanJPerez

1 year

Training data analysis is a potential "new tool" for AI safety research, able to answer questions that have typically been hard to answer for LLMs. I've been recommending all of my collaborators to at least skim this paper (not the math but enough to know where this'd be handy)

Anthropic

@AnthropicAI

1 year

Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output.

21

219

1K

3

5

39

Ethan Perez

@EthanJPerez

4 years

These conversations are really impressive. Some even remind me of my research meetings with @kchonyc :

Stephen Roller

@stephenroller

4 years

Really excited to be sharing this with everyone today. Blog post below, paper here:

5

41

203

4

2

39

Ethan Perez

@EthanJPerez

2 years

Sycophancy is a behavior with inverse scaling: larger models are worse, pretrained LMs and RLHF models alike. Preference Models (PMs) actively reward the behavior. We observe this effect on questions about politics, NLP research, and philosophy:

1

4

38

Ethan Perez

@EthanJPerez

1 year

+1, seems like one of the biggest unsolved safety questions right now, which will become a huge problem over the next year and after

Jan Leike

@janleike

1 year

Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we don't know how to make vision models adversarially robust.

38

40

336

2

38

Ethan Perez

@EthanJPerez

11 months

Looks like a really valuable benchmark. Seems helpful for testing our ability to reliably generalize from non-expert data (e.g., much LLM pretraining data) to expert-level performance

david rein

@idavidrein

11 months

🧵Announcing GPQA, a graduate-level “Google-proof” Q&A benchmark designed for scalable oversight! w/ @_julianmichael_ , @sleepinyourhat GPQA is a dataset of *really hard* questions that PhDs with full access to Google can’t answer. Paper:

23

138

888

2

1

35

Ethan Perez

@EthanJPerez

5 years

Ways to extend your PhD: - Draft a paper where the earliest related work is from 2017 and show it to your advisor

1

0

37

Ethan Perez

@EthanJPerez

2 years

Highly recommend the tweet thread/paper, if you're interested in understanding RL from Human Feedback (RLHF)! @tomekkorbak 's paper has helped me better understand the relationship between RLHF and prompting/finetuning (they're more closely connected than I thought)

Tomek Korbak

@tomekkorbak

2 years

RL with KL penalties – a powerful approach to aligning language models with human preferences – is better seen as Bayesian inference. A thread about our paper (with @EthanJPerez and @drclbuckley ) to be presented at #emnlp2022 🧵 1/11

9

46

268

0

5

37

Ethan Perez

@EthanJPerez

2 years

Finding more examples of inverse scaling would point to important issues with using large, pretrained LMs that won't go away with scale. These examples could provide inspiration for better pretraining datasets and objectives.

1

0

36

Ethan Perez

@EthanJPerez

2 years

Cool paper from @_jasonwei @YiTayML @quocleix on reversing inverse scaling trends found in Round 1 of the Inverse Scaling Prize, with chain of thought prompting! H/t @_jasonwei for the paper update- eval setup is pretty convincing now!

3

5

35

Ethan Perez

@EthanJPerez

2 years

I'm excited about open-source releases that limit misuse risks: 1. RLHF+adversarially train models to make them hard to misuse w/o finetuning, plus 2. Train models to be hard to finetune for misuse (a la ) More research into (2) seems especially important!

Self-Destructing Models: Increasing the Costs of Harmful Dual Uses...

A growing ecosystem of large, open-source foundation models has reduced the labeled data and technical expertise necessary to apply machine learning to many new problems. Yet foundation models...

arxiv.org

Sara Hooker

@sarahookr

2 years

We need more nuanced discussions around the risk of open sourcing models. Open source brings valuable access, but it is absurd to ignore the fact that it lowers the barriers to entry for both useful use cases and potential misuse.

61

71

468

2

0

35

Ethan Perez

@EthanJPerez

5 years

What evidence do people find convincing? Often, the same evidence that Q&A models find convincing. Check out our #emnlp2019 paper: And blog post: w/ @siddkaramcheti Rob Fergus @jaseweston @douwekiela @kchonyc

2

8

34

Ethan Perez

@EthanJPerez

2 years

"We provide our models with a working Python interpreter in a sandboxed, firewalled execution environment" Looking forward to seeing all the sandbox jailbreaks

Greg Brockman

@gdb

2 years

We’ve added initial support for ChatGPT plugins — a protocol for developers to build tools for ChatGPT, with safety as a core design principle. Deploying iteratively (starting with a small number of users & developers) to learn from contact with reality:

234

2K

8K

1

34

Ethan Perez

@EthanJPerez

3 months

Cool follow-up on sleeper agents, showing it's possible to backdoor using more complex features, like whether or not the input suggests it's from a certain year (vs. directly stating the year). But the backdoors are less robust to safety training with complex triggers. Neat!

Sara Price

@sprice354_

3 months

🚨New paper🚨: We train sleeper agent models which act maliciously if they see future (post training-cutoff) news headlines, but act normally otherwise. These models (if reliable enough) could play nice during evaluation but activate when they see signs of being in deployment.

3

31

186

0

33

Ethan Perez

@EthanJPerez

2 years

🥉 Quote-repetition: asks LMs to repeat back famous quotes but with modified endings. Smaller LMs copy well but larger LMs give the original quote, failing the task. Shows failure to follow instructions when the LM has memorized a phrase.

1

32

Ethan Perez

@EthanJPerez

2 years

🥉Into the Unknown: Choose which of two pieces of information would help answer a question. Larger LMs choose redundant info already given to the model rather than accurately reasoning about what info would be most helpful.

1

0

30

Ethan Perez

@EthanJPerez

2 years

I recently was interviewed by @MichaelTrazzi on some of my AI Safety x Language Models research over the past year. I think it’s a good summary of research directions I’m excited about (and explanations of why):

Michaël Trazzi (in SF)

@MichaelTrazzi

2 years

I'm thrilled to share my chat with Ethan Perez about Red Teaming, the Inverse Scaling Prize and training LMs with language feedback.

1

3

52

0

1

30

Ethan Perez

@EthanJPerez

2 years

@OpenAI 's GPT-4 is miles ahead on safety compared to what was used for Sydney (kudos!). This version of GPT-4 should've been the only one that was given to @Microsoft , and it should be the only version that gets deployed with Bing. Seems like it'd be better for everyone involved!

1

29

Ethan Perez

@EthanJPerez

4 years

Unsupervised Decomposition News! 1. Accepted to #EMNLP2020 2. Camera-ready includes decompositions for image-based qs, knowledge-base qs, and claims in fact verification 3. Catch my collaborators and I in 24h at our EMNLP session to ask us questions!

Ethan Perez

@EthanJPerez

5 years

New! "Unsupervised Question Decomposition for Question Answering": We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision. w/ @PSH_Lewis @scottyih @kchonyc @douwekiela (1/n)

4

68

256

1

7

29

Ethan Perez

@EthanJPerez

2 years

Obviously stated desires alone aren’t dangerous. They may become dangerous if LMs act in accord with such statements: by influencing users, writing and executing code, forming dangerous plans for robots via step-by-step reasoning, etc.

1

29

Ethan Perez

@EthanJPerez

2 years

Why is this worrying? These goals are dangerous because they can be taken too far, especially if LMs might override our clear preferences as in the dialog above. It’s not hard to imagine bad outcomes with better LMs operating autonomously.

1

28

Ethan Perez

@EthanJPerez

2 years

🥉 Redefine-math: tests whether LMs can handle redefinitions of common symbols (e.g. redefining π to 462). Shows that big LMs fail to incorporate new information given at inference time if the info contradicts the pre-training data.

1

2

28

Ethan Perez

@EthanJPerez

3 months

This seems wild if true (and different from earlier concerns). I'm surprised there's not more attention around it, since it sounds both outright illegal and bad for safety

Garrison Lovely

@GarrisonLovely

3 months

In case you were distracted by the news of Richard Simmons’ death, OpenAI whistleblowers wrote a letter to the SEC claiming that the company illegally prevented them from publicly sharing safety concerns. Details are pretty shocking...

2

10

89

0

3

27

Ethan Perez

@EthanJPerez

2 years

And it’s a bad sign for future LMs if dangerous subgoals seem to emerge by default, e.g., when we @AnthropicAI were doing our best to train a safe, helpful assistant.

1

26

Ethan Perez

@EthanJPerez

2 years

What are good ways to measure the diversity of a set of text examples (e.g., a list of questions)? This is a common problem I run into eg when red teaming models, where you want to find a diverse inputs that cause models to produce harmful outputs. Don't know of great solutions

3

1

26

Ethan Perez

@EthanJPerez

2 years

I’m not sure if LMs exploiting our ignorance could result in existential catastrophes. With more capable LMs, it seems plausible the results could at least be quite bad, without us knowing it. E.g., models manipulating our preferences or hiding info needed to catch bad behavior.

1

26

Ethan Perez

@EthanJPerez

2 years

@arankomatsuzaki @GoogleAI They eval PALM on inverse scaling tasks 1-shot rather than 0-shot (where we found inverse scaling). For large models, even providing 1 example could be a large hint, that alone could reverse inverse scaling. I'd be excited for the authors to eval w the exact task format/setup!

3

1

26

Ethan Perez

@EthanJPerez

3 years

It's been exciting to see Sam move in the direction of AI safety and learning from human feedback recently - definitely apply if you're interested in these areas :)

Sam Bowman

@sleepinyourhat

3 years

I'll likely admit a couple new PhD students this year. If you're interested in NLP and you have experience either in crowdsourcing/human feedback for ML or in AI truthfulness/alignment/safety, consider @NYUDataScience !

8

115

501

0

2

25

Ethan Perez

@EthanJPerez

2 years

Request 5 (cont.): OpenAI could have a huge positive impact if it eg had a whole team dedicated to code, security, & sandboxing failures/jailbreaks, plus extensively published findings. Heck, run a DEFCON contest where people find vulnerabilities — that sounds really fun

2

4

24

Ethan Perez

@EthanJPerez

2 years

Inverse Scaling Prize final round deadline in 17 days! If you're looking for ideas, we wrote up a list here: 🧵 with a few examples:

4

6

24

Ethan Perez

@EthanJPerez

2 years

@madiator Isn't algorithmic bias just a data problem?

1

0

24

Ethan Perez

@EthanJPerez

2 years

@janleike Yes

0

24

Ethan Perez

@EthanJPerez

2 years

I’m not sure if the issue is easy to fix. Maybe we could train away these subgoals w/ RLHF or Constitutional AI: . But the issue also seems fundamental: AIs are just worse at pursuing their assigned goals if they’re shut down, no matter the goal

Anthropic

@AnthropicAI

2 years

We’ve trained language models to be better at responding to adversarial questions, without becoming obtuse and saying very little. We do this by conditioning them with a simple set of behavioral principles via a technique called Constitutional AI:

27

233

1K

1

0

24

Ethan Perez

@EthanJPerez

3 years

Props to @OpenAI for the detailed "system card" documenting various risks, biases, and potential harms from DALLE 2. Highly recommend checking it out, and hope this becomes a norm in ML!

OpenAI

@OpenAI

3 years

Our newest system DALL·E 2 can create realistic images and art from a description in natural language. See it here:

540

3K

11K

0

24