As AI improves humans will need more and more help to monitor and control it. So my team at OpenAI have trained an AI that helps humans to evaluate AI! (1/5)
We need new technical breakthroughs to steer and control AI systems much smarter than us.
Our new Superalignment team aims to solve this problem within 4 years, and we’re dedicating 20% of the compute we've secured to date towards this problem.
Join us!
Working on a 280 billion parameter language model has greatly reduced how long I think it will take to build AGI. Very excited we finally released the details of Gopher - awesome work from the team!
@drjwrae
@geoffreyirving
Cooking up something special! Can't wait to get a paper out so everyone can try it out.
An optimizer with no extra overhead, no additional parameters.
Stay tuned!
Nonsense QA with no prompting is not an interesting failure of large language models. Any vaguely sensible prompt (like the one from the Gopher paper) greatly reduces it, indicating it will not be a hard problem to completely solve with RL etc.
seeing o1 do this well on completely fresh international-competition level coding problems was an amazing moment.
If you don’t agree that this novel reasoning then your definition of novel reasoning is broken 😅
Evaluating o1 on the International Olympiad of Informatics was very personally meaningful to me. When I competed nine years ago, I never thought I'd be back—so soon—competing with an AI.
To highlight how amazing this model is, we shared on Codeforces its best IOI submissions ⬇️
@robertskmiles
suppose scores are [100, 100, 1, 2, 3], idx 0 and 1 are by far one of the best, e.g. in the closely tied group that is far better than most of the distribution
I'm excited to join
@AnthropicAI
to continue the superalignment mission!
My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.
If you're interested in joining, my dms are open.
people keep saying that the rollout and adoption of AGI is gonna take a lot of thought, software engineering and intelligence.
oh boy do I have the technology for you!
Minerva and DeepNash are both surprising progress even against my short timelines. Much more so than Dalle2 was, having seen GLIDE - but around that GLIDE level of omg. Imagining 2030 is getting really hard.
@percyliang
@NPCollapse
the area of "scalable oversight" focuses in precisely this, see the work of myself & Geoffrey Irving & Sam Bowman & Ethan Perez & Jeffrey Wu & Jan Leike & many others. Sam's latest paper is excellent:
We looked specifically at model-written code for almost all our evaluations, and we already see huge potential for GPT-4-class models to assist humans in RLHF labelling (2/5)
1/ Notable how three pioneers of deep learning ( recognised in their shared 2018 Turing award) have substantially diverged on how they assess risk from superintelligence:
Now is the time for progress on superintelligence alignment; this is why
@ilyasut
and
@janleike
are joining forces to lead the new super-effort. Join us!
Alternatively, if you have no idea why folks are talking so seriously about risks from rogue AI (but you have a science or engineering background) here’s a super-alignment reading list…
Believing that cars could change Earth's climate requires imagining hundreds of millions in circulation, an unrealistic scenario benefiting the auto industry's narrative. They're just pushing car hype!
Overall it was an exciting and huge collaboration, the results of which you can now read: . Huge thanks to everyone involved, but particularly to the rest of the joint-first authors who made the thing work!
@mia_glaese
@majatrebacz
@john_aslanides
(5/5)
I’m tremendously excited for the future of human-machine teams in evaluation and training. If you want to work on this technology one of the best ways to do it is for
@mia_glaese
who runs human data here at OpenAI. They’re the best in the business! (4/5)
twenty years from now, i would bet ai x-risk will look a lot like y2k does now
- nothing cataclysmic happened, so the common view is that it was a fake concern all along
- the counterfactual risk was actually real though
- it was only averted through a lot of human effort
We started by RL fine-tuning models to be more helpful, but that made the resulting policies much more exploitable when you try and trick them into bad behaviour. We had to jointly train for usefulness and safety to get better at both! (2/5)
The description length of all models in the same transformer family is the same at initialisation, regardless of param count; it is this description length (& not after training) that is relevant to optimal compression. (1/n)
The "scale is all you need" hype around ever-larger language models is a striking inversion of our (usual) preference for making models simpler. Occam's Big Ball of Mud
So happy to finally talk about Red Teaming! TL;DR even seemingly well behaved dialogue models fall down completely if you search hard enough for adversarial questions...
Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users.
Read more: 1/
And the harm mitigations we used (rule classifiers and preference models) *don't* solve distributional bias problems - they just remove bad behaviours that you can see in a single sample (lots of detail in paper, 4/5)
i'd particularly like to recognize
@CollinBurns4
for today's generalization result, who came to openai excited to pursue this vision and helped get the rest of the team excited about it!
As LLMs become capable of superhuman reasoning we need methods that let us understand why and how they reached their conclusions. Unfortunately the chain of thought that gets the best performance might not be the easiest for humans to understand — the “legibility gap”. 2/n
I’m thrilled the team were able to formally define a notion of legibility that behaves sensibly when approximately optimized and that the results generalize to humans. Amazing work by the two lead authors
@cynnjjs
and
@janhkirchner
and the rest of the team! 3/n
> "Our generation too easily takes for granted that we live in peace and freedom. And those who herald the age of AGI in SF too often ignore the elephant in the room: superintelligence is a matter of national security, and the United States must win." (1/2)
fun story:
terry tao was on both my and my brother's committee.
he solved both our dissertation problems before we were done talking, each of us got "wouldn't it have been easier to...outline of entire proof" 🫠
The first author of this astrophysics paper found that if he gave o1-preview the methods section, it was able to reproduce 10 months of work coding he did as a PhD in 5 prompts (a few caveats in the video)
Side note: all of your methods sections are becoming instruction manuals.
I feel like the image understanding capabilities of GPT4 are currently underrated. API access or the evals paper are going to blow minds. (based only on assuming paper examples are not cherries; which was true of '3)
Catchy quote, but Bernard Arnault just became the richest man on earth; are you sure you want to short fashion?
Like it or not, when people gain material abundance, they mostly spend it on status. The real question is whether we can design status games that are positive-sum.
we laughed at this originally but LLM @ int4 + inference hardware and batteries would fit, either now or in one generation's time. so we have evolved to the point of the internet in a box / hitchhiker's guide. exciting times.
How do we learn what will be informative? It helps to separate aleatoric & epistemic uncertainty. Ian argues you can do this with the joint distribution of your labels - and has a key paper on it, introducing EpiNets (3/n)