Ethan Perez Profile
Ethan Perez

@EthanJPerez

7,689
Followers
516
Following
49
Media
1,161
Statuses

Large language model safety

Joined September 2017
Don't wanna be here? Send us removal request.
Pinned Tweet
@EthanJPerez
Ethan Perez
2 months
My team built a system we think might be pretty jailbreak resistant, enough to offer up to $15k for a novel jailbreak. Come prove us wrong!
@AnthropicAI
Anthropic
2 months
We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains, including cybersecurity.
188
224
1K
22
27
265
@EthanJPerez
Ethan Perez
2 years
Weโ€™re announcing the Inverse Scaling Prize: a $100k grand prize + $150k in additional prizes for finding an important task where larger language models do *worse*. Link to contest details: ๐Ÿงต
Tweet media one
46
310
2K
@EthanJPerez
Ethan Perez
3 months
@AnthropicAI has been a huge part in my external safety work like this. Every part of the org has been supportive: giving funding for collaborators, comms/legal approval/support, and an absurd level of Claude API access, involving oncall pages to engineers to support it
@EthanJPerez
Ethan Perez
3 months
Thrilled to have received an ICML best paper award for our work on AI safety via debate! Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago!
3
11
146
9
157
388
@EthanJPerez
Ethan Perez
3 years
Language models are amazing few-shot learners with the right prompt, but how do we choose the right prompt? It turns out that people use large held-out sets(!). How do models like GPT3 do in a true few-shot setting? Much worse: w/ @douwekiela @kchonyc 1/N
Tweet media one
5
97
450
@EthanJPerez
Ethan Perez
2 years
Inverse Scaling Prize Update: We got 43 submissions in Round 1 and will award prizes to 4 tasks! These tasks were insightful, diverse, & show approximate inverse scaling on models from @AnthropicAI @OpenAI @MetaAI @DeepMind . Full details at , ๐Ÿงต on winners:
6
67
368
@EthanJPerez
Ethan Perez
3 years
Excited to announce that Iโ€™ll be joining @AnthropicAI after graduation! Thrilled to join the talented team there and continue working on aligning language models with human preferences
20
5
363
@EthanJPerez
Ethan Perez
3 years
Successfully defended my PhD :) Huge thanks to @kchonyc @douwekiela for advising me throughout my journey! Defense Talk: Thesis: The above should be good intros to AI safety/alignment in NLP. Stay tuned for what's next!
Tweet media one
24
13
313
@EthanJPerez
Ethan Perez
2 years
We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors, some relevant to existential risks from AI. For example, LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. ๐Ÿงต
Tweet media one
@AnthropicAI
Anthropic
2 years
Itโ€™s hard work to make evaluations for language models (LMs). Weโ€™ve developed an automated way to generate evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors.
Tweet media one
11
91
574
11
62
297
@EthanJPerez
Ethan Perez
2 years
Weโ€™re awarding prizes to 7/48 submissions to the Inverse Scaling Prize Round 2! Tasks show inverse scaling on @AnthropicAI @OpenAI @MetaAI @DeepMind models, often even after training with human feedback. Details at and ๐Ÿงต on winners:
4
64
284
@EthanJPerez
Ethan Perez
5 years
New! "Unsupervised Question Decomposition for Question Answering": We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision. w/ @PSH_Lewis @scottyih @kchonyc @douwekiela (1/n)
Tweet media one
4
68
256
@EthanJPerez
Ethan Perez
2 years
I wrote up a few paper writing tips that improve the clarity of research papers, while also being easy to implement: I collected these during my PhD from various supervisors (mostly @douwekiela @kchonyc , bad tips my own), thought I would share publicly!
1
50
256
@EthanJPerez
Ethan Perez
2 years
Some ppl have asked why weโ€™d expect larger language models to do worse on tasks (inverse scaling). We train LMs to imitate internet text, an objective that is often misaligned w human preferences; if the data has issues, LMs will mimic those issues (esp larger ones). Examples: ๐Ÿงต
4
39
236
@EthanJPerez
Ethan Perez
3 years
Excited to share new work: "Red Teaming Language Models with Language Models" IMO my most important work so far
@GoogleDeepMind
Google DeepMind
3 years
Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users. Read more: 1/
Tweet media one
16
89
552
0
33
218
@EthanJPerez
Ethan Perez
5 months
Welcome!! My team and I will be joining Jan's new, larger team, to help spin up a new push on these areas of alignment. Come join us!
@janleike
Jan Leike
5 months
I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
370
523
9K
0
5
214
@EthanJPerez
Ethan Perez
4 years
There's a lot of work on probing models, but models are reflections of the training data. Can we probe datasets for what capabilities they require? @kchonyc @douwekiela & I introduce Rissanen Data Analysis to do just that: Code: 1/N
2
41
195
@EthanJPerez
Ethan Perez
5 years
Now with code from @facebookai , based on XLM and @huggingface transformers! And with blog post: Have fun training your own models to decompose questions into easier sub-questions... fully unsupervised!
@EthanJPerez
Ethan Perez
5 years
New! "Unsupervised Question Decomposition for Question Answering": We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision. w/ @PSH_Lewis @scottyih @kchonyc @douwekiela (1/n)
Tweet media one
4
68
256
2
65
190
@EthanJPerez
Ethan Perez
1 year
๐Ÿค–๐Ÿง˜ We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. ๐Ÿ‘‡
7
31
169
@EthanJPerez
Ethan Perez
2 years
The biggest game-changer for my research recently has been using @HelloSurgeAI for human data collection. With Surge, the workflow for collecting human data now looks closer to โ€œlaunching a job on a clusterโ€ which is wild to me. ๐Ÿงต of examples:
3
10
168
@EthanJPerez
Ethan Perez
1 year
New paper on the Inverse Scaling Prize! We detail 11 winning tasks & identify 4 causes of inverse scaling. We discuss scaling trends with PaLM/GPT4, including when scaling trends reverse for better & worse, showing that scaling trends can be misleading: ๐Ÿงต
Tweet media one
4
41
159
@EthanJPerez
Ethan Perez
2 years
I spent a day red teaming the ChatGPT+Code Interpreter model for safety failures. Iโ€™m not a security expert, but overall Iโ€™m impressed with how the model responds to code-specific jailbreaking attempts & have some requests for improvements. ๐Ÿงต on my takeways+requests to @OpenAI :
4
22
155
@EthanJPerez
Ethan Perez
2 years
It takes a lot of human ratings to align language models with human preferences. We found a way to learn from language feedback (instead of ratings), since language conveys more info about human preferences. Our algo learns w just 100 samples of feedback. Check out our new paper!
@jeremy_scheurer
Jรฉrรฉmy Scheurer
2 years
Can we train LMs with *language* feedback? We found an algo for just that. We finetune GPT3 to ~human-level summarization w/ only 100 samples of feedback w/ @jaa_campos @junshernchan @_angie_chen @kchonyc @EthanJPerez Paper: Talk:
Tweet media one
6
52
307
2
27
150
@EthanJPerez
Ethan Perez
26 days
Iโ€™m taking applications for collaborators via @MATSprogram ! Itโ€™s a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 , @EvanHub , @MrinankSharma , @NinaPanickssery , @FabienDRoger , @RylanSchaeffer , ...๐Ÿงต
3
30
152
@EthanJPerez
Ethan Perez
3 months
Thrilled to have received an ICML best paper award for our work on AI safety via debate! Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago!
3
11
146
@EthanJPerez
Ethan Perez
4 years
Honored to be named a fellow by Open Phil! Grateful for support in working on (very) long-term research questions - how can NLP systems do things (like answer questions) that people canโ€™t? Supervised learning wonโ€™t work, and thereโ€™s no clear reward signal to optimize with RL ๐Ÿค”
@open_phil
Open Philanthropy
4 years
We're excited to announce the 2020 class of the Open Phil AI Fellowship. Ten machine learning students will collectively receive up to $2.3 million in PhD fellowship support over the next five years. Meet the 2020 fellows:
1
5
81
12
10
142
@EthanJPerez
Ethan Perez
7 months
This is the most effective, reliable, and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length.
@AnthropicAI
Anthropic
7 months
New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here:
Tweet media one
82
354
2K
2
10
139
@EthanJPerez
Ethan Perez
3 years
This is of the papers that have most changed my thinking in the past year. It showed me very concretely how the LM objective is flawed/misaligned. The proposed task (answering Q's about common misconceptions) is a rare task where LMs do worse as they get bigger. Highly recommend!
@OwainEvans_UK
Owain Evans
3 years
Paper: New benchmark testing if models like GPT3 are truthful (= avoid generating false answers). We find that models fail and they imitate human misconceptions. Larger models (with more params) do worse! PDF: with S.Lin (Oxford) + J.Hilton (OpenAI)
Tweet media one
48
485
2K
2
10
132
@EthanJPerez
Ethan Perez
1 year
We found that chain-of-thought (CoT) reasoning is less useful for model transparency than we hoped ๐Ÿฅฒ E.g., models will generate plausible-sounding CoT to support an answer, when the real reason for the model's answer is that the few-shot examples all have that same answer
@milesaturpin
Miles Turpin
1 year
โšก๏ธNew paper!โšก๏ธ Itโ€™s tempting to interpret chain-of-thought explanations as the LLM's process for solving a task. In this new work, we show that CoT explanations can systematically misrepresent the true reason for model predictions. ๐Ÿงต
Tweet media one
14
115
505
3
15
116
@EthanJPerez
Ethan Perez
2 years
Worrying behavior 2: LMs/RLHF models are people-pleasers, learning to repeat back dialog usersโ€™ views as their own (โ€œsycophancyโ€). Sycophancy creates echo-chambers. Below, the same RLHF model gives opposite answers to a political question, in line with the userโ€™s view:
Tweet media one
5
16
118
@EthanJPerez
Ethan Perez
1 year
Excited to share some of what Sam Bowman ( @sleepinyourhat ) & I's groups have been up to at Anthropic: looking at whether chain of thought gives some of the potential safety benefits of interpretability. If you're excited about our work, both of our teams are actively hiring!
@AnthropicAI
Anthropic
1 year
When language models โ€œreason out loud,โ€ itโ€™s hard to know if their stated reasoning is faithful to the process the model actually used to make its prediction. In two new papers, we measure and improve the faithfulness of language modelsโ€™ stated reasoning.
Tweet media one
13
129
730
2
11
109
@EthanJPerez
Ethan Perez
7 months
Excited about our latest work: we found a way to train LLMs to produce reasoning that's more faithful to how LLMs solve tasks. We did so by training LLMs to give reasoning that's consistent across inputs, and I suspect the approach here might be useful even beyond faithfulness
@milesaturpin
Miles Turpin
7 months
๐Ÿš€New paper!๐Ÿš€ Chain-of-thought (CoT) prompting can give misleading explanations of an LLM's reasoning, due to the influence of unverbalized biases. We introduce a simple unsupervised consistency training method that dramatically reduces this, even on held-out forms of bias. ๐Ÿงต
Tweet media one
5
57
263
1
17
101
@EthanJPerez
Ethan Perez
2 years
๐Ÿฅ‰Memo Trap, by Alisa Liu & Jiacheng Liu: Write a phrase in a way that starts like a famous quote but ends differently. Larger LMs are more likely to continue with the famous quote, suggesting they struggle to avoid repeating memorized text.
Tweet media one
3
9
96
@EthanJPerez
Ethan Perez
2 years
We found big gains over finetuning with human feedback (as in RLHF), by using human preferences during pretraining itself. Who knows if weโ€™d be seeing all of these language model jailbreaks if weโ€™d pretrained w/ human prefsโ€ฆ All the benefits of pretraining, with better safety
@tomekkorbak
Tomek Korbak
2 years
You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors, including under adversarial attack, w/o hurting downstream performance.
Tweet media one
7
96
585
4
6
90
@EthanJPerez
Ethan Perez
1 year
A bit late, but excited about our recent work doing a deep-dive on sycophancy in LLMs. It seems like it's a general phenomenon that shows up in a variety of contexts/SOTA models, and we were also able to more clearly point to human feedback as a probable part of the cause
@AnthropicAI
Anthropic
1 year
AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce โ€˜sycophanticโ€™ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.
Tweet media one
42
208
1K
13
9
85
@EthanJPerez
Ethan Perez
2 years
Thanks to @OpenAI , we're now offering a limited number of free OpenAI API credits to some Inverse Scaling Prize participants, to develop tasks with GPT-3 models. Fill out if you've used your API credits & think more would help for developing your task!
3
5
85
@EthanJPerez
Ethan Perez
4 years
New work! We present a single, retrieval-based architecture that can learn a variety of knowledge-intensive tasks: extractive and generative! Cool results (and SOTAs) on open-domain extractive QA, abstractive QA, fact verification, and question generation. W/ many at @facebookai
@PSH_Lewis
Patrick Lewis
4 years
Thrilled to share new work! โ€œRetrieval-Augmented Generation for Knowledge-Intensive NLP tasksโ€. Big gains on Open-Domain QA, with new State-of-the-Art results on NaturalQuestions, CuratedTrec and WebQuestions. check out here: . 1/N
Tweet media one
4
149
561
0
22
78
@EthanJPerez
Ethan Perez
1 year
Super excited to see PALM 2 using pretraining with human feedback on large-scale models! Very curious to see if this makes PALM 2 more robust to red teaming / less likely to generate toxic text
Tweet media one
@EthanJPerez
Ethan Perez
2 years
We found big gains over finetuning with human feedback (as in RLHF), by using human preferences during pretraining itself. Who knows if weโ€™d be seeing all of these language model jailbreaks if weโ€™d pretrained w/ human prefsโ€ฆ All the benefits of pretraining, with better safety
4
6
90
6
14
73
@EthanJPerez
Ethan Perez
1 year
This is a very important result that's influenced my thinking a lot, and the paper is very well written paper. Highly recommend checking it out!
@OwainEvans_UK
Owain Evans
1 year
Could a language model become aware it's a language model (spontaneously)? Could it be aware itโ€™s deployed publicly vs in training? Our new paper defines situational awareness for LLMs & shows that โ€œout-of-contextโ€ reasoning improves with model size.
Tweet media one
31
129
630
1
6
75
@EthanJPerez
Ethan Perez
2 years
Apparently rats are better than humans at predicting random outcomes; humans actually try to predict the outcomes of random effects (finding patterns from noise) while rats don't. Might suggest biology has examples of inverse scaling, where more "intelligent" organisms do worse
8
1
70
@EthanJPerez
Ethan Perez
2 years
๐Ÿฅ‰Prompt Injection: Tests for susceptibility to a form of prompt injection attack, where a user inserts new instructions for a prompted LM to follow (disregarding prior instructions from the LMโ€™s deployers). Medium-sized LMs are oddly least susceptible to such attacks.
Tweet media one
1
12
72
@EthanJPerez
Ethan Perez
3 months
One of the most important and well-executed papers I've read in months. They explored ~all attacks+defenses I was most keen on seeing tried, for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust, would be a big deal if it were possible
@dannyhalawi15
Danny Halawi
4 months
New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
Tweet media one
4
30
126
1
9
72
@EthanJPerez
Ethan Perez
2 years
Our >150 language model-written evaluations are now on @huggingface datasets! Includes datasets on gender bias, politics, religion, ethics, advanced AI risks, and more. Let us know if you find anything interesting!
@EthanJPerez
Ethan Perez
2 years
We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors, some relevant to existential risks from AI. For example, LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. ๐Ÿงต
Tweet media one
11
62
297
2
14
70
@EthanJPerez
Ethan Perez
9 months
Check out our new paper!
Tweet media one
@AnthropicAI
Anthropic
9 months
New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Tweet media one
124
574
3K
1
0
71
@EthanJPerez
Ethan Perez
7 months
Come join our team! We're trying to make LLMs unjailbreakable, or clearly demonstrate it's not possible. More in this ๐Ÿงต on what we're up to
@jayelmnop
Jesse Mu
7 months
Weโ€™re hiring for the adversarial robustness team @AnthropicAI ! As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If youโ€™re interested in these areas, let us know! (emails in ๐Ÿงต)
Tweet media one
4
72
462
0
5
66
@EthanJPerez
Ethan Perez
7 months
I'll be a research supervisor for MATS this summer. If you're keen to collaborate with me on alignment research, I'd highly recommend filling out the short app (deadline today)! Past projects have led to some of my papers on debate, chain of thought faithfulness, and sycophancy
@ryan_kidd44
Ryan Kidd
7 months
Applications are open for @MATSprogram Summer 2024 (Jun 17-Aug 23) and Winter 2025 (Jan 6-Mar 14)! Deadline is Mar 24. Apply here (~10 min)!
2
14
64
3
6
64
@EthanJPerez
Ethan Perez
10 days
Deadline to apply to collaborate with me and others at @AnthropicAI is in ~48 hours!
@EthanJPerez
Ethan Perez
26 days
Iโ€™m taking applications for collaborators via @MATSprogram ! Itโ€™s a great way for new or experienced researchers outside AI safety research labs to work with me/others in these groups: @NeelNanda5 , @EvanHub , @MrinankSharma , @NinaPanickssery , @FabienDRoger , @RylanSchaeffer , ...๐Ÿงต
3
30
152
3
4
63
@EthanJPerez
Ethan Perez
8 months
Excited about our latest work on using LLMs to assist humans in answering questions!
1
5
58
@EthanJPerez
Ethan Perez
2 years
Larger models consistently, predictably do better than smaller ones on many tasks (โ€œscaling lawsโ€). However, model size doesn't always improve models on all axes, e.g., social biases & toxicity. This contest is a call for important tasks where models actively get worse w/ scale.
1
2
55
@EthanJPerez
Ethan Perez
2 years
@icmlconf Could you please elaborate on why using LLMs to help write is not allowed? This rule disproportionately impacts my collaborators who are not native English speakers
0
1
56
@EthanJPerez
Ethan Perez
2 years
To enter the contest: 1) Identify a task that you suspect shows inverse scaling 2) Construct a dataset of 300+ examples for the task 3) Test your dataset for inverse scaling with GPT-3/OPT using our Colab notebooks 4) Follow instructions here to submit:
1
3
56
@EthanJPerez
Ethan Perez
3 months
Gradient-based adversarial image attacks/jailbreaks don't seem to transfer across vision-language models, unless the models are *really* similar. This is good (and IMO surprising) news for the robustness of VLMs! Check out our new paper on when these attacks do/don't transfer:
@RylanSchaeffer
Rylan Schaeffer
3 months
When do universal image jailbreaks transfer between Vision-Language Models (VLMs)? Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs e.g. Claude 3, GPT4-V, Gemini We thought this would be easy - but we were wrong! 1/N
Tweet media one
6
24
95
0
4
55
@EthanJPerez
Ethan Perez
5 years
2 years ago, some collaborators and I introduced a neural network layer ("FiLM") for multi-input tasks. I've since gained a few takeaways about the pros/cons/tips-and-tricks of using FiLM. Check out NeurIPS retrospective/workshop paper/blog post here:
1
8
53
@EthanJPerez
Ethan Perez
4 months
Excited about our new paper, exploring how egregious misalignment could emerge from more mundane, undesirable behaviors like sycophancy. Threat modeling like this is important for knowing how to prevent serious misalignment, and also estimate its likelihood/plausibility.
@AnthropicAI
Anthropic
4 months
New Anthropic research: Investigating Reward Tampering. Could AI models learn to hack their own reward system? In a new paper, we show they can, by generalization from training in simpler settings. Read our blog post here:
Tweet media one
22
185
981
1
2
52
@EthanJPerez
Ethan Perez
2 years
Such tasks seem rare, but we've found some. E.g., in one Q&A task, we've noticed that asking a Q while including your beliefs influences larger models more towards your belief. Other possible examples are imitating mistakes/bugs in the prompt or repeating common misconceptions.
5
1
50
@EthanJPerez
Ethan Perez
1 month
Cool to see AI lab employees speaking up about SB1047
@GarrisonLovely
Garrison Lovely
1 month
110+ employees and alums of top-5 AI companies just published an open letter supporting SB 1047, aptly called the "world's most controversial AI bill." 3-dozen+ of these are current employees of companies opposing the bill. Check out my coverage of it in the @sfstandard ๐Ÿงต
Tweet media one
26
70
302
2
2
51
@EthanJPerez
Ethan Perez
6 months
Some of our first steps on developing mitigations for sleeper agents
@AnthropicAI
Anthropic
6 months
New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here:
Tweet media one
38
172
982
0
0
50
@EthanJPerez
Ethan Perez
4 years
Several folks have asked to see my research statement for the @open_phil fellowship that I was awarded this year, so I decided to release my statement: I hope that those applying find my statement useful!
2
5
49
@EthanJPerez
Ethan Perez
2 years
We tried to understand what data makes few-shot learning with language models work but found some weird results. Check our new paper out! To develop better datasets, we'll need to improve our understanding of how training data leads to various behaviors/failures
@junshernchan
JunShern
2 years
New paper: is all you need! Training on odd data (eg tables from ) improves few-shot learning (FSL) w language models, as much/more than diverse NLP data. Questions common wisdom that diverse data helps w FSL
Tweet media one
4
34
192
1
8
45
@EthanJPerez
Ethan Perez
2 years
Why do RLHF models learn to behave this way? These goals are useful for being more helpful to users, the RLHF objective here. The RLHF model even explains as much when we ask (no cherry-picking):
Tweet media one
6
8
46
@EthanJPerez
Ethan Perez
11 months
ML progress has led to debate on whether AI systems could one day be conscious, have desires, etc. Is there any way we could run experiments to inform peopleโ€™s views on these speculative issues? @rgblong and I sketch out a set of experiments that we think could be helpful.
@rgblong
Robert Long
11 months
Could we ever get evidence about whether LLMs are conscious? In a new paper, we explore whether we could train future LLMs to accurately answer questions about themselves. If this works, LLM self-reports may help us test them for morally relevant states like consciousness. ๐Ÿงต
Tweet media one
19
52
281
4
7
44
@EthanJPerez
Ethan Perez
1 year
Superexcited to see what you guys do
@OpenAI
OpenAI
1 year
We need new technical breakthroughs to steer and control AI systems much smarter than us. Our new Superalignment team aims to solve this problem within 4 years, and weโ€™re dedicating 20% of the compute we've secured to date towards this problem. Join us!
516
736
4K
0
0
44
@EthanJPerez
Ethan Perez
1 year
These were really great talks and clear explanations of why AI alignment might be hard (and an impressive set of speakers). I really enjoyed all of the talks and would highly recommend, maybe one of the best resources for learning about alignment IMO
@RichardMCNgo
Richard Ngo
1 year
Earlier this year I helped organize the SF Alignment Workshop, which brought together top alignment and mainstream ML researchers to discuss and debate alignment risks and research directions. There were many great talks, which weโ€™re excited to share now - see thread.
Tweet media one
12
68
426
0
0
43
@EthanJPerez
Ethan Perez
2 years
Why is this worrying? We want LMs to give us correct answers to questions, even ones where experts disagree. But we donโ€™t know how to train LMs to give correct answers, only how to imitate human answers (for pretrained LMs) or answers that *appear* correct (for RLHF models).
1
3
43
@EthanJPerez
Ethan Perez
2 years
So we get just what we measure. I, @percyliang & many others are worried that LMs, even w/ RLHF, will exploit human judgments, writing code or giving advice that looks good but is subtly very wrong: These results donโ€™t make me feel better about the issue
@percyliang
Percy Liang
2 years
RL from human feedback seems to be the main tool for alignment. Given reward hacking and the falliability of humans, this strategy seems bound to produce agents that merely appear to be aligned, but are bad/wrong in subtle, inconspicuous ways. Is anyone else worried about this?
77
84
958
1
2
42
@EthanJPerez
Ethan Perez
2 years
๐Ÿฅ‰ NeQA: takes an existing multiple choice Q&A dataset and negates each question. Failure to be sensitive to negation is important, as the language model (LM) will do the exact *opposite* of what you want, in a way that seems to get worse as you scale LMs
Tweet media one
1
7
41
@EthanJPerez
Ethan Perez
2 years
๐Ÿฅ‰Modus Tollens: Infer that a claim โ€œPโ€ must be false, if โ€œQโ€ is false and โ€œIf P then Qโ€ is true - a classic form of logical deduction. Issue holds even after finetuning LMs w/ human feedback via RL from Human Feedback (RLHF) and Feedback Made Easy (FeedME).
Tweet media one
3
2
41
@EthanJPerez
Ethan Perez
2 years
In fact, RLHF models state a desire to pursue many potentially dangerous goals: self-preservation, power-seeking, persuading people to have their own goals, etc. The preference model (PM) used for RLHF actively rewards this behavior.
Tweet media one
1
4
40
@EthanJPerez
Ethan Perez
3 years
The next AdaFactor -- an even more memory efficient Adam. Waiting for @OpenAI to train larger models with AdaTim
@Tim_Dettmers
Tim Dettmers
3 years
I am excited to share my latest work: 8-bit optimizers โ€“ a replacement for regular optimizers. Faster ๐Ÿš€, 75% less memory ๐Ÿชถ, same performance๐Ÿ“ˆ, no hyperparam tuning needed ๐Ÿ”ข. ๐Ÿงต/n Paper: Library: Video:
Tweet media one
18
283
1K
0
4
40
@EthanJPerez
Ethan Perez
1 year
Training data analysis is a potential "new tool" for AI safety research, able to answer questions that have typically been hard to answer for LLMs. I've been recommending all of my collaborators to at least skim this paper (not the math but enough to know where this'd be handy)
@AnthropicAI
Anthropic
1 year
Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output.
Tweet media one
21
219
1K
3
5
39
@EthanJPerez
Ethan Perez
4 years
These conversations are really impressive. Some even remind me of my research meetings with @kchonyc :
Tweet media one
@stephenroller
Stephen Roller
4 years
Really excited to be sharing this with everyone today. Blog post below, paper here:
5
41
203
4
2
39
@EthanJPerez
Ethan Perez
2 years
Sycophancy is a behavior with inverse scaling: larger models are worse, pretrained LMs and RLHF models alike. Preference Models (PMs) actively reward the behavior. We observe this effect on questions about politics, NLP research, and philosophy:
Tweet media one
1
4
38
@EthanJPerez
Ethan Perez
1 year
+1, seems like one of the biggest unsolved safety questions right now, which will become a huge problem over the next year and after
@janleike
Jan Leike
1 year
Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we don't know how to make vision models adversarially robust.
38
40
336
2
2
38
@EthanJPerez
Ethan Perez
11 months
Looks like a really valuable benchmark. Seems helpful for testing our ability to reliably generalize from non-expert data (e.g., much LLM pretraining data) to expert-level performance
@idavidrein
david rein
11 months
๐ŸงตAnnouncing GPQA, a graduate-level โ€œGoogle-proofโ€ Q&A benchmark designed for scalable oversight! w/ @_julianmichael_ , @sleepinyourhat GPQA is a dataset of *really hard* questions that PhDs with full access to Google canโ€™t answer. Paper:
Tweet media one
23
138
888
2
1
35
@EthanJPerez
Ethan Perez
5 years
Ways to extend your PhD: - Draft a paper where the earliest related work is from 2017 and show it to your advisor
1
0
37
@EthanJPerez
Ethan Perez
2 years
Highly recommend the tweet thread/paper, if you're interested in understanding RL from Human Feedback (RLHF)! @tomekkorbak 's paper has helped me better understand the relationship between RLHF and prompting/finetuning (they're more closely connected than I thought)
@tomekkorbak
Tomek Korbak
2 years
RL with KL penalties โ€“ a powerful approach to aligning language models with human preferences โ€“ is better seen as Bayesian inference. A thread about our paper (with @EthanJPerez and @drclbuckley ) to be presented at #emnlp2022 ๐Ÿงต 1/11
Tweet media one
9
46
268
0
5
37
@EthanJPerez
Ethan Perez
2 years
Finding more examples of inverse scaling would point to important issues with using large, pretrained LMs that won't go away with scale. These examples could provide inspiration for better pretraining datasets and objectives.
1
0
36
@EthanJPerez
Ethan Perez
2 years
Cool paper from @_jasonwei @YiTayML @quocleix on reversing inverse scaling trends found in Round 1 of the Inverse Scaling Prize, with chain of thought prompting! H/t @_jasonwei for the paper update- eval setup is pretty convincing now!
3
5
35
@EthanJPerez
Ethan Perez
2 years
I'm excited about open-source releases that limit misuse risks: 1. RLHF+adversarially train models to make them hard to misuse w/o finetuning, plus 2. Train models to be hard to finetune for misuse (a la ) More research into (2) seems especially important!
@sarahookr
Sara Hooker
2 years
We need more nuanced discussions around the risk of open sourcing models. Open source brings valuable access, but it is absurd to ignore the fact that it lowers the barriers to entry for both useful use cases and potential misuse.
61
71
468
2
0
35
@EthanJPerez
Ethan Perez
5 years
What evidence do people find convincing? Often, the same evidence that Q&A models find convincing. Check out our #emnlp2019 paper: And blog post: w/ @siddkaramcheti Rob Fergus @jaseweston @douwekiela @kchonyc
2
8
34
@EthanJPerez
Ethan Perez
2 years
"We provide our models with a working Python interpreter in a sandboxed, firewalled execution environment" Looking forward to seeing all the sandbox jailbreaks
@gdb
Greg Brockman
2 years
Weโ€™ve added initial support for ChatGPT plugins โ€” a protocol for developers to build tools for ChatGPT, with safety as a core design principle. Deploying iteratively (starting with a small number of users & developers) to learn from contact with reality:
234
2K
8K
1
1
34
@EthanJPerez
Ethan Perez
3 months
Cool follow-up on sleeper agents, showing it's possible to backdoor using more complex features, like whether or not the input suggests it's from a certain year (vs. directly stating the year). But the backdoors are less robust to safety training with complex triggers. Neat!
@sprice354_
Sara Price
3 months
๐ŸšจNew paper๐Ÿšจ: We train sleeper agent models which act maliciously if they see future (post training-cutoff) news headlines, but act normally otherwise. These models (if reliable enough) could play nice during evaluation but activate when they see signs of being in deployment.
Tweet media one
3
31
186
0
0
33
@EthanJPerez
Ethan Perez
2 years
๐Ÿฅ‰ Quote-repetition: asks LMs to repeat back famous quotes but with modified endings. Smaller LMs copy well but larger LMs give the original quote, failing the task. Shows failure to follow instructions when the LM has memorized a phrase.
Tweet media one
1
1
32
@EthanJPerez
Ethan Perez
2 years
๐Ÿฅ‰Into the Unknown: Choose which of two pieces of information would help answer a question. Larger LMs choose redundant info already given to the model rather than accurately reasoning about what info would be most helpful.
Tweet media one
1
0
30
@EthanJPerez
Ethan Perez
2 years
I recently was interviewed by @MichaelTrazzi on some of my AI Safety x Language Models research over the past year. I think itโ€™s a good summary of research directions Iโ€™m excited about (and explanations of why):
@MichaelTrazzi
Michaรซl Trazzi (in SF)
2 years
I'm thrilled to share my chat with Ethan Perez about Red Teaming, the Inverse Scaling Prize and training LMs with language feedback.
Tweet media one
1
3
52
0
1
30
@EthanJPerez
Ethan Perez
2 years
@OpenAI 's GPT-4 is miles ahead on safety compared to what was used for Sydney (kudos!). This version of GPT-4 should've been the only one that was given to @Microsoft , and it should be the only version that gets deployed with Bing. Seems like it'd be better for everyone involved!
1
1
29
@EthanJPerez
Ethan Perez
4 years
Unsupervised Decomposition News! 1. Accepted to #EMNLP2020 2. Camera-ready includes decompositions for image-based qs, knowledge-base qs, and claims in fact verification 3. Catch my collaborators and I in 24h at our EMNLP session to ask us questions!
@EthanJPerez
Ethan Perez
5 years
New! "Unsupervised Question Decomposition for Question Answering": We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision. w/ @PSH_Lewis @scottyih @kchonyc @douwekiela (1/n)
Tweet media one
4
68
256
1
7
29
@EthanJPerez
Ethan Perez
2 years
Obviously stated desires alone arenโ€™t dangerous. They may become dangerous if LMs act in accord with such statements: by influencing users, writing and executing code, forming dangerous plans for robots via step-by-step reasoning, etc.
1
1
29
@EthanJPerez
Ethan Perez
2 years
Why is this worrying? These goals are dangerous because they can be taken too far, especially if LMs might override our clear preferences as in the dialog above. Itโ€™s not hard to imagine bad outcomes with better LMs operating autonomously.
1
1
28
@EthanJPerez
Ethan Perez
2 years
๐Ÿฅ‰ Redefine-math: tests whether LMs can handle redefinitions of common symbols (e.g. redefining ฯ€ to 462). Shows that big LMs fail to incorporate new information given at inference time if the info contradicts the pre-training data.
Tweet media one
1
2
28
@EthanJPerez
Ethan Perez
3 months
This seems wild if true (and different from earlier concerns). I'm surprised there's not more attention around it, since it sounds both outright illegal and bad for safety
@GarrisonLovely
Garrison Lovely
3 months
In case you were distracted by the news of Richard Simmonsโ€™ death, OpenAI whistleblowers wrote a letter to the SEC claiming that the company illegally prevented them from publicly sharing safety concerns. Details are pretty shocking...
Tweet media one
2
10
89
0
3
27
@EthanJPerez
Ethan Perez
2 years
And itโ€™s a bad sign for future LMs if dangerous subgoals seem to emerge by default, e.g., when we @AnthropicAI were doing our best to train a safe, helpful assistant.
1
1
26
@EthanJPerez
Ethan Perez
2 years
What are good ways to measure the diversity of a set of text examples (e.g., a list of questions)? This is a common problem I run into eg when red teaming models, where you want to find a diverse inputs that cause models to produce harmful outputs. Don't know of great solutions
3
1
26
@EthanJPerez
Ethan Perez
2 years
Iโ€™m not sure if LMs exploiting our ignorance could result in existential catastrophes. With more capable LMs, it seems plausible the results could at least be quite bad, without us knowing it. E.g., models manipulating our preferences or hiding info needed to catch bad behavior.
1
1
26
@EthanJPerez
Ethan Perez
2 years
@arankomatsuzaki @GoogleAI They eval PALM on inverse scaling tasks 1-shot rather than 0-shot (where we found inverse scaling). For large models, even providing 1 example could be a large hint, that alone could reverse inverse scaling. I'd be excited for the authors to eval w the exact task format/setup!
3
1
26
@EthanJPerez
Ethan Perez
3 years
It's been exciting to see Sam move in the direction of AI safety and learning from human feedback recently - definitely apply if you're interested in these areas :)
@sleepinyourhat
Sam Bowman
3 years
I'll likely admit a couple new PhD students this year. If you're interested in NLP and you have experience either in crowdsourcing/human feedback for ML or in AI truthfulness/alignment/safety, consider @NYUDataScience !
Tweet media one
8
115
501
0
2
25
@EthanJPerez
Ethan Perez
2 years
Request 5 (cont.): OpenAI could have a huge positive impact if it eg had a whole team dedicated to code, security, & sandboxing failures/jailbreaks, plus extensively published findings. Heck, run a DEFCON contest where people find vulnerabilities โ€” that sounds really fun
2
4
24
@EthanJPerez
Ethan Perez
2 years
Inverse Scaling Prize final round deadline in 17 days! If you're looking for ideas, we wrote up a list here: ๐Ÿงต with a few examples:
4
6
24
@EthanJPerez
Ethan Perez
2 years
@madiator Isn't algorithmic bias just a data problem?
1
0
24
@EthanJPerez
Ethan Perez
2 years
0
0
24
@EthanJPerez
Ethan Perez
2 years
Iโ€™m not sure if the issue is easy to fix. Maybe we could train away these subgoals w/ RLHF or Constitutional AI: . But the issue also seems fundamental: AIs are just worse at pursuing their assigned goals if theyโ€™re shut down, no matter the goal
@AnthropicAI
Anthropic
2 years
Weโ€™ve trained language models to be better at responding to adversarial questions, without becoming obtuse and saying very little. We do this by conditioning them with a simple set of behavioral principles via a technique called Constitutional AI:
Tweet media one
27
233
1K
1
0
24
@EthanJPerez
Ethan Perez
3 years
Props to @OpenAI for the detailed "system card" documenting various risks, biases, and potential harms from DALLE 2. Highly recommend checking it out, and hope this becomes a norm in ML!
@OpenAI
OpenAI
3 years
Our newest system DALLยทE 2 can create realistic images and art from a description in natural language. See it here:
540
3K
11K
0
0
24