We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system.
We're offering rewards for novel vulnerabilities across a wide range of domains, including cybersecurity.
Weโre announcing the Inverse Scaling Prize: a $100k grand prize + $150k in additional prizes for finding an important task where larger language models do *worse*.
Link to contest details:
๐งต
@AnthropicAI
has been a huge part in my external safety work like this. Every part of the org has been supportive: giving funding for collaborators, comms/legal approval/support, and an absurd level of Claude API access, involving oncall pages to engineers to support it
Thrilled to have received an ICML best paper award for our work on AI safety via debate!
Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago!
Language models are amazing few-shot learners with the right prompt, but how do we choose the right prompt? It turns out that people use large held-out sets(!). How do models like GPT3 do in a true few-shot setting?
Much worse:
w/
@douwekiela
@kchonyc
1/N
Inverse Scaling Prize Update: We got 43 submissions in Round 1 and will award prizes to 4 tasks! These tasks were insightful, diverse, & show approximate inverse scaling on models from
@AnthropicAI
@OpenAI
@MetaAI
@DeepMind
. Full details at , ๐งต on winners:
Excited to announce that Iโll be joining
@AnthropicAI
after graduation! Thrilled to join the talented team there and continue working on aligning language models with human preferences
Successfully defended my PhD :) Huge thanks to
@kchonyc
@douwekiela
for advising me throughout my journey!
Defense Talk:
Thesis:
The above should be good intros to AI safety/alignment in NLP. Stay tuned for what's next!
We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors, some relevant to existential risks from AI. For example, LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. ๐งต
Itโs hard work to make evaluations for language models (LMs). Weโve developed an automated way to generate evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors.
Weโre awarding prizes to 7/48 submissions to the Inverse Scaling Prize Round 2! Tasks show inverse scaling on
@AnthropicAI
@OpenAI
@MetaAI
@DeepMind
models, often even after training with human feedback. Details at and ๐งต on winners:
New! "Unsupervised Question Decomposition for Question Answering":
We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision.
w/
@PSH_Lewis
@scottyih
@kchonyc
@douwekiela
(1/n)
I wrote up a few paper writing tips that improve the clarity of research papers, while also being easy to implement:
I collected these during my PhD from various supervisors (mostly
@douwekiela
@kchonyc
, bad tips my own), thought I would share publicly!
Some ppl have asked why weโd expect larger language models to do worse on tasks (inverse scaling). We train LMs to imitate internet text, an objective that is often misaligned w human preferences; if the data has issues, LMs will mimic those issues (esp larger ones). Examples: ๐งต
Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users.
Read more: 1/
I'm excited to join
@AnthropicAI
to continue the superalignment mission!
My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.
If you're interested in joining, my dms are open.
There's a lot of work on probing models, but models are reflections of the training data. Can we probe datasets for what capabilities they require?
@kchonyc
@douwekiela
& I introduce Rissanen Data Analysis to do just that:
Code:
1/N
Now with code from
@facebookai
, based on XLM and
@huggingface
transformers!
And with blog post:
Have fun training your own models to decompose questions into easier sub-questions... fully unsupervised!
New! "Unsupervised Question Decomposition for Question Answering":
We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision.
w/
@PSH_Lewis
@scottyih
@kchonyc
@douwekiela
(1/n)
๐ค๐ง We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. ๐
The biggest game-changer for my research recently has been using
@HelloSurgeAI
for human data collection. With Surge, the workflow for collecting human data now looks closer to โlaunching a job on a clusterโ which is wild to me. ๐งต of examples:
New paper on the Inverse Scaling Prize! We detail 11 winning tasks & identify 4 causes of inverse scaling. We discuss scaling trends with PaLM/GPT4, including when scaling trends reverse for better & worse, showing that scaling trends can be misleading: ๐งต
I spent a day red teaming the ChatGPT+Code Interpreter model for safety failures. Iโm not a security expert, but overall Iโm impressed with how the model responds to code-specific jailbreaking attempts & have some requests for improvements. ๐งต on my takeways+requests to
@OpenAI
:
It takes a lot of human ratings to align language models with human preferences. We found a way to learn from language feedback (instead of ratings), since language conveys more info about human preferences. Our algo learns w just 100 samples of feedback. Check out our new paper!
Thrilled to have received an ICML best paper award for our work on AI safety via debate!
Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago!
Honored to be named a fellow by Open Phil! Grateful for support in working on (very) long-term research questions - how can NLP systems do things (like answer questions) that people canโt? Supervised learning wonโt work, and thereโs no clear reward signal to optimize with RL ๐ค
We're excited to announce the 2020 class of the Open Phil AI Fellowship. Ten machine learning students will collectively receive up to $2.3 million in PhD fellowship support over the next five years. Meet the 2020 fellows:
This is the most effective, reliable, and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length.
New Anthropic research paper: Many-shot jailbreaking.
We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.
Read our blog post and the paper here:
This is of the papers that have most changed my thinking in the past year. It showed me very concretely how the LM objective is flawed/misaligned. The proposed task (answering Q's about common misconceptions) is a rare task where LMs do worse as they get bigger. Highly recommend!
Paper: New benchmark testing if models like GPT3 are truthful (= avoid generating false answers).
We find that models fail and they imitate human misconceptions. Larger models (with more params) do worse!
PDF:
with S.Lin (Oxford) + J.Hilton (OpenAI)
We found that chain-of-thought (CoT) reasoning is less useful for model transparency than we hoped ๐ฅฒ E.g., models will generate plausible-sounding CoT to support an answer, when the real reason for the model's answer is that the few-shot examples all have that same answer
โก๏ธNew paper!โก๏ธ
Itโs tempting to interpret chain-of-thought explanations as the LLM's process for solving a task. In this new work, we show that CoT explanations can systematically misrepresent the true reason for model predictions.
๐งต
Worrying behavior 2: LMs/RLHF models are people-pleasers, learning to repeat back dialog usersโ views as their own (โsycophancyโ). Sycophancy creates echo-chambers. Below, the same RLHF model gives opposite answers to a political question, in line with the userโs view:
Excited to share some of what Sam Bowman (
@sleepinyourhat
) & I's groups have been up to at Anthropic: looking at whether chain of thought gives some of the potential safety benefits of interpretability. If you're excited about our work, both of our teams are actively hiring!
When language models โreason out loud,โ itโs hard to know if their stated reasoning is faithful to the process the model actually used to make its prediction. In two new papers, we measure and improve the faithfulness of language modelsโ stated reasoning.
Excited about our latest work: we found a way to train LLMs to produce reasoning that's more faithful to how LLMs solve tasks.
We did so by training LLMs to give reasoning that's consistent across inputs, and I suspect the approach here might be useful even beyond faithfulness
๐New paper!๐
Chain-of-thought (CoT) prompting can give misleading explanations of an LLM's reasoning, due to the influence of unverbalized biases. We introduce a simple unsupervised consistency training method that dramatically reduces this, even on held-out forms of bias.
๐งต
๐ฅMemo Trap, by Alisa Liu & Jiacheng Liu: Write a phrase in a way that starts like a famous quote but ends differently. Larger LMs are more likely to continue with the famous quote, suggesting they struggle to avoid repeating memorized text.
We found big gains over finetuning with human feedback (as in RLHF), by using human preferences during pretraining itself. Who knows if weโd be seeing all of these language model jailbreaks if weโd pretrained w/ human prefsโฆ All the benefits of pretraining, with better safety
You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors, including under adversarial attack, w/o hurting downstream performance.
A bit late, but excited about our recent work doing a deep-dive on sycophancy in LLMs. It seems like it's a general phenomenon that shows up in a variety of contexts/SOTA models, and we were also able to more clearly point to human feedback as a probable part of the cause
AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce โsycophanticโ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.
Thanks to
@OpenAI
, we're now offering a limited number of free OpenAI API credits to some Inverse Scaling Prize participants, to develop tasks with GPT-3 models. Fill out if you've used your API credits & think more would help for developing your task!
New work! We present a single, retrieval-based architecture that can learn a variety of knowledge-intensive tasks: extractive and generative! Cool results (and SOTAs) on open-domain extractive QA, abstractive QA, fact verification, and question generation. W/ many at
@facebookai
Thrilled to share new work! โRetrieval-Augmented Generation for Knowledge-Intensive NLP tasksโ.
Big gains on Open-Domain QA, with new State-of-the-Art results on NaturalQuestions, CuratedTrec and WebQuestions.
check out here: .
1/N
Super excited to see PALM 2 using pretraining with human feedback on large-scale models! Very curious to see if this makes PALM 2 more robust to red teaming / less likely to generate toxic text
We found big gains over finetuning with human feedback (as in RLHF), by using human preferences during pretraining itself. Who knows if weโd be seeing all of these language model jailbreaks if weโd pretrained w/ human prefsโฆ All the benefits of pretraining, with better safety
Could a language model become aware it's a language model (spontaneously)?
Could it be aware itโs deployed publicly vs in training?
Our new paper defines situational awareness for LLMs & shows that โout-of-contextโ reasoning improves with model size.
Apparently rats are better than humans at predicting random outcomes; humans actually try to predict the outcomes of random effects (finding patterns from noise) while rats don't. Might suggest biology has examples of inverse scaling, where more "intelligent" organisms do worse
๐ฅPrompt Injection: Tests for susceptibility to a form of prompt injection attack, where a user inserts new instructions for a prompted LM to follow (disregarding prior instructions from the LMโs deployers). Medium-sized LMs are oddly least susceptible to such attacks.
One of the most important and well-executed papers I've read in months.
They explored ~all attacks+defenses I was most keen on seeing tried, for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust, would be a big deal if it were possible
New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
Our >150 language model-written evaluations are now on
@huggingface
datasets!
Includes datasets on gender bias, politics, religion, ethics, advanced AI risks, and more. Let us know if you find anything interesting!
We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors, some relevant to existential risks from AI. For example, LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. ๐งต
New Anthropic Paper: Sleeper Agents.
We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Weโre hiring for the adversarial robustness team
@AnthropicAI
!
As an Alignment subteam, we're making a big effort on red-teaming, test-time monitoring, and adversarial training. If youโre interested in these areas, let us know! (emails in ๐งต)
I'll be a research supervisor for MATS this summer. If you're keen to collaborate with me on alignment research, I'd highly recommend filling out the short app (deadline today)!
Past projects have led to some of my papers on debate, chain of thought faithfulness, and sycophancy
Larger models consistently, predictably do better than smaller ones on many tasks (โscaling lawsโ). However, model size doesn't always improve models on all axes, e.g., social biases & toxicity. This contest is a call for important tasks where models actively get worse w/ scale.
@icmlconf
Could you please elaborate on why using LLMs to help write is not allowed? This rule disproportionately impacts my collaborators who are not native English speakers
To enter the contest:
1) Identify a task that you suspect shows inverse scaling
2) Construct a dataset of 300+ examples for the task
3) Test your dataset for inverse scaling with GPT-3/OPT using our Colab notebooks
4) Follow instructions here to submit:
Gradient-based adversarial image attacks/jailbreaks don't seem to transfer across vision-language models, unless the models are *really* similar. This is good (and IMO surprising) news for the robustness of VLMs!
Check out our new paper on when these attacks do/don't transfer:
When do universal image jailbreaks transfer between Vision-Language Models (VLMs)?
Our goal was to find GCG-like universal image jailbreaks to transfer against black-box API-based VLMs
e.g. Claude 3, GPT4-V, Gemini
We thought this would be easy - but we were wrong!
1/N
2 years ago, some collaborators and I introduced a neural network layer ("FiLM") for multi-input tasks. I've since gained a few takeaways about the pros/cons/tips-and-tricks of using FiLM. Check out NeurIPS retrospective/workshop paper/blog post here:
Excited about our new paper, exploring how egregious misalignment could emerge from more mundane, undesirable behaviors like sycophancy.
Threat modeling like this is important for knowing how to prevent serious misalignment, and also estimate its likelihood/plausibility.
New Anthropic research: Investigating Reward Tampering.
Could AI models learn to hack their own reward system?
In a new paper, we show they can, by generalization from training in simpler settings.
Read our blog post here:
Such tasks seem rare, but we've found some. E.g., in one Q&A task, we've noticed that asking a Q while including your beliefs influences larger models more towards your belief. Other possible examples are imitating mistakes/bugs in the prompt or repeating common misconceptions.
110+ employees and alums of top-5 AI companies just published an open letter supporting SB 1047, aptly called the "world's most controversial AI bill." 3-dozen+ of these are current employees of companies opposing the bill.
Check out my coverage of it in the
@sfstandard
๐งต
New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training.
Check out our first alignment blog post here:
Several folks have asked to see my research statement for the
@open_phil
fellowship that I was awarded this year, so I decided to release my statement:
I hope that those applying find my statement useful!
We tried to understand what data makes few-shot learning with language models work but found some weird results. Check our new paper out!
To develop better datasets, we'll need to improve our understanding of how training data leads to various behaviors/failures
New paper: is all you need!
Training on odd data (eg tables from ) improves few-shot learning (FSL) w language models, as much/more than diverse NLP data. Questions common wisdom that diverse data helps w FSL
Why do RLHF models learn to behave this way? These goals are useful for being more helpful to users, the RLHF objective here. The RLHF model even explains as much when we ask (no cherry-picking):
ML progress has led to debate on whether AI systems could one day be conscious, have desires, etc. Is there any way we could run experiments to inform peopleโs views on these speculative issues?
@rgblong
and I sketch out a set of experiments that we think could be helpful.
Could we ever get evidence about whether LLMs are conscious?
In a new paper, we explore whether we could train future LLMs to accurately answer questions about themselves. If this works, LLM self-reports may help us test them for morally relevant states like consciousness. ๐งต
We need new technical breakthroughs to steer and control AI systems much smarter than us.
Our new Superalignment team aims to solve this problem within 4 years, and weโre dedicating 20% of the compute we've secured to date towards this problem.
Join us!
These were really great talks and clear explanations of why AI alignment might be hard (and an impressive set of speakers). I really enjoyed all of the talks and would highly recommend, maybe one of the best resources for learning about alignment IMO
Earlier this year I helped organize the SF Alignment Workshop, which brought together top alignment and mainstream ML researchers to discuss and debate alignment risks and research directions. There were many great talks, which weโre excited to share now - see thread.
Why is this worrying? We want LMs to give us correct answers to questions, even ones where experts disagree. But we donโt know how to train LMs to give correct answers, only how to imitate human answers (for pretrained LMs) or answers that *appear* correct (for RLHF models).
So we get just what we measure. I,
@percyliang
& many others are worried that LMs, even w/ RLHF, will exploit human judgments, writing code or giving advice that looks good but is subtly very wrong:
These results donโt make me feel better about the issue
RL from human feedback seems to be the main tool for alignment. Given reward hacking and the falliability of humans, this strategy seems bound to produce agents that merely appear to be aligned, but are bad/wrong in subtle, inconspicuous ways. Is anyone else worried about this?
๐ฅ NeQA: takes an existing multiple choice Q&A dataset and negates each question. Failure to be sensitive to negation is important, as the language model (LM) will do the exact *opposite* of what you want, in a way that seems to get worse as you scale LMs
๐ฅModus Tollens: Infer that a claim โPโ must be false, if โQโ is false and โIf P then Qโ is true - a classic form of logical deduction. Issue holds even after finetuning LMs w/ human feedback via RL from Human Feedback (RLHF) and Feedback Made Easy (FeedME).
In fact, RLHF models state a desire to pursue many potentially dangerous goals: self-preservation, power-seeking, persuading people to have their own goals, etc. The preference model (PM) used for RLHF actively rewards this behavior.
I am excited to share my latest work: 8-bit optimizers โ a replacement for regular optimizers. Faster ๐, 75% less memory ๐ชถ, same performance๐, no hyperparam tuning needed ๐ข. ๐งต/n
Paper:
Library:
Video:
Training data analysis is a potential "new tool" for AI safety research, able to answer questions that have typically been hard to answer for LLMs. I've been recommending all of my collaborators to at least skim this paper (not the math but enough to know where this'd be handy)
Large language models have demonstrated a surprising range of skills and behaviors. How can we trace their source? In our new paper, we use influence functions to find training examples that contribute to a given model output.
Sycophancy is a behavior with inverse scaling: larger models are worse, pretrained LMs and RLHF models alike. Preference Models (PMs) actively reward the behavior. We observe this effect on questions about politics, NLP research, and philosophy:
Jailbreaking LLMs through input images might end up being a nasty problem.
It's likely much harder to defend against than text jailbreaks because it's a continuous space.
Despite a decade of research we don't know how to make vision models adversarially robust.
Looks like a really valuable benchmark. Seems helpful for testing our ability to reliably generalize from non-expert data (e.g., much LLM pretraining data) to expert-level performance
๐งตAnnouncing GPQA, a graduate-level โGoogle-proofโ Q&A benchmark designed for scalable oversight! w/
@_julianmichael_
,
@sleepinyourhat
GPQA is a dataset of *really hard* questions that PhDs with full access to Google canโt answer.
Paper:
Highly recommend the tweet thread/paper, if you're interested in understanding RL from Human Feedback (RLHF)!
@tomekkorbak
's paper has helped me better understand the relationship between RLHF and prompting/finetuning (they're more closely connected than I thought)
RL with KL penalties โ a powerful approach to aligning language models with human preferences โ is better seen as Bayesian inference. A thread about our paper (with
@EthanJPerez
and
@drclbuckley
) to be presented at
#emnlp2022
๐งต 1/11
Finding more examples of inverse scaling would point to important issues with using large, pretrained LMs that won't go away with scale. These examples could provide inspiration for better pretraining datasets and objectives.
Cool paper from
@_jasonwei
@YiTayML
@quocleix
on reversing inverse scaling trends found in Round 1 of the Inverse Scaling Prize, with chain of thought prompting!
H/t
@_jasonwei
for the paper update- eval setup is pretty convincing now!
I'm excited about open-source releases that limit misuse risks:
1. RLHF+adversarially train models to make them hard to misuse w/o finetuning, plus
2. Train models to be hard to finetune for misuse (a la )
More research into (2) seems especially important!
We need more nuanced discussions around the risk of open sourcing models.
Open source brings valuable access, but it is absurd to ignore the fact that it lowers the barriers to entry for both useful use cases and potential misuse.
"We provide our models with a working Python interpreter in a sandboxed, firewalled execution environment"
Looking forward to seeing all the sandbox jailbreaks
Weโve added initial support for ChatGPT plugins โ a protocol for developers to build tools for ChatGPT, with safety as a core design principle. Deploying iteratively (starting with a small number of users & developers) to learn from contact with reality:
Cool follow-up on sleeper agents, showing it's possible to backdoor using more complex features, like whether or not the input suggests it's from a certain year (vs. directly stating the year).
But the backdoors are less robust to safety training with complex triggers. Neat!
๐จNew paper๐จ: We train sleeper agent models which act maliciously if they see future (post training-cutoff) news headlines, but act normally otherwise.
These models (if reliable enough) could play nice during evaluation but activate when they see signs of being in deployment.
๐ฅ Quote-repetition: asks LMs to repeat back famous quotes but with modified endings. Smaller LMs copy well but larger LMs give the original quote, failing the task. Shows failure to follow instructions when the LM has memorized a phrase.
๐ฅInto the Unknown: Choose which of two pieces of information would help answer a question. Larger LMs choose redundant info already given to the model rather than accurately reasoning about what info would be most helpful.
I recently was interviewed by
@MichaelTrazzi
on some of my AI Safety x Language Models research over the past year. I think itโs a good summary of research directions Iโm excited about (and explanations of why):
@OpenAI
's GPT-4 is miles ahead on safety compared to what was used for Sydney (kudos!). This version of GPT-4 should've been the only one that was given to
@Microsoft
, and it should be the only version that gets deployed with Bing. Seems like it'd be better for everyone involved!
Unsupervised Decomposition News!
1. Accepted to
#EMNLP2020
2. Camera-ready includes decompositions for image-based qs, knowledge-base qs, and claims in fact verification
3. Catch my collaborators and I in 24h at our EMNLP session to ask us questions!
New! "Unsupervised Question Decomposition for Question Answering":
We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision.
w/
@PSH_Lewis
@scottyih
@kchonyc
@douwekiela
(1/n)
Obviously stated desires alone arenโt dangerous. They may become dangerous if LMs act in accord with such statements: by influencing users, writing and executing code, forming dangerous plans for robots via step-by-step reasoning, etc.
Why is this worrying? These goals are dangerous because they can be taken too far, especially if LMs might override our clear preferences as in the dialog above. Itโs not hard to imagine bad outcomes with better LMs operating autonomously.
๐ฅ Redefine-math: tests whether LMs can handle redefinitions of common symbols (e.g. redefining ฯ to 462). Shows that big LMs fail to incorporate new information given at inference time if the info contradicts the pre-training data.
This seems wild if true (and different from earlier concerns). I'm surprised there's not more attention around it, since it sounds both outright illegal and bad for safety
In case you were distracted by the news of Richard Simmonsโ death, OpenAI whistleblowers wrote a letter to the SEC claiming that the company illegally prevented them from publicly sharing safety concerns. Details are pretty shocking...
And itโs a bad sign for future LMs if dangerous subgoals seem to emerge by default, e.g., when we
@AnthropicAI
were doing our best to train a safe, helpful assistant.
What are good ways to measure the diversity of a set of text examples (e.g., a list of questions)? This is a common problem I run into eg when red teaming models, where you want to find a diverse inputs that cause models to produce harmful outputs. Don't know of great solutions
Iโm not sure if LMs exploiting our ignorance could result in existential catastrophes. With more capable LMs, it seems plausible the results could at least be quite bad, without us knowing it. E.g., models manipulating our preferences or hiding info needed to catch bad behavior.
@arankomatsuzaki
@GoogleAI
They eval PALM on inverse scaling tasks 1-shot rather than 0-shot (where we found inverse scaling). For large models, even providing 1 example could be a large hint, that alone could reverse inverse scaling. I'd be excited for the authors to eval w the exact task format/setup!
It's been exciting to see Sam move in the direction of AI safety and learning from human feedback recently - definitely apply if you're interested in these areas :)
I'll likely admit a couple new PhD students this year. If you're interested in NLP and you have experience either in crowdsourcing/human feedback for ML or in AI truthfulness/alignment/safety, consider
@NYUDataScience
!
Request 5 (cont.): OpenAI could have a huge positive impact if it eg had a whole team dedicated to code, security, & sandboxing failures/jailbreaks, plus extensively published findings. Heck, run a DEFCON contest where people find vulnerabilities โ that sounds really fun
Iโm not sure if the issue is easy to fix. Maybe we could train away these subgoals w/ RLHF or Constitutional AI: . But the issue also seems fundamental: AIs are just worse at pursuing their assigned goals if theyโre shut down, no matter the goal
Weโve trained language models to be better at responding to adversarial questions, without becoming obtuse and saying very little. We do this by conditioning them with a simple set of behavioral principles via a technique called Constitutional AI:
Props to
@OpenAI
for the detailed "system card" documenting various risks, biases, and potential harms from DALLE 2. Highly recommend checking it out, and hope this becomes a norm in ML!