Stephen McAleer Profile Banner
Stephen McAleer Profile
Stephen McAleer

@McaleerStephen

4,165
Followers
856
Following
69
Media
572
Statuses

Researching agent safety at OpenAI

San Francisco
Joined July 2014
Don't wanna be here? Send us removal request.
Pinned Tweet
@McaleerStephen
Stephen McAleer
8 months
"Toward General Virtual Agents" I recently gave a talk at MIT. I argued that we should use tools from reinforcement learning and search to improve the capability and alignment of LLM agents. Slides: Video:
5
78
504
@McaleerStephen
Stephen McAleer
8 months
We invented Q* first Glad openai is building on top of our idea
Tweet media one
61
292
3K
@McaleerStephen
Stephen McAleer
2 months
I joined OpenAI where I'm researching agent safety! As we approach AGI, ensuring that powerful agents are aligned and safe will become the key bottleneck to making them useful. Our team is hiring, consider applying!
39
21
539
@McaleerStephen
Stephen McAleer
1 year
Forget plugins. ChatGPT can solve general computer tasks using a keyboard and mouse!! The trick? Recursively criticizing and improving the output (RCI). We also find that RCI prompting outperforms CoT prompting on reasoning tasks. Paper, website, and GitHub in the 🧵below.
13
75
431
@McaleerStephen
Stephen McAleer
8 months
Guys this was just a joke I don't know what openai q* is 😂
13
5
273
@McaleerStephen
Stephen McAleer
7 months
This is an interesting paper that learns a process reward model without human annotations. The idea is to evaluate the accuracy of full reasoning traces generated from a given partial reasoning step. Nice to see Llemma-34B getting 47.3% on MATH!
Tweet media one
4
25
185
@McaleerStephen
Stephen McAleer
7 months
Why do RLHF at all? If you had enough SFT data could you get the same performance? Is preference data just easier to collect? There is less information in preference data--e.g. you could infer preferences from SFT data using inverse RL. Would love references on this!
22
7
142
@McaleerStephen
Stephen McAleer
9 months
AI Alignment: A Comprehensive Survey We break AI alignment into four categories: 1. Learning from feedback (e.g. RLHF) 2. Learning under distribution shift 3. Assurance (e.g. interpretability) 4. Governance Reply with any references we missed!
Tweet media one
3
24
135
@McaleerStephen
Stephen McAleer
11 months
I'm co-teaching a course this semester at CMU on computational game solving. The first half focuses on foundations and the second half covers state-of-the-art approaches for large games such as Stratego, Diplomacy, etc.
2
23
121
@McaleerStephen
Stephen McAleer
11 months
I'm excited to be co-organizing the Foundation Models for Decision Making workshop at NeurIPS! Please consider submitting your recent work on LLM agents, foundation models + RL, RLHF, LLMs + search, etc. The deadline is in one month on October 1st.
Tweet media one
3
16
96
@McaleerStephen
Stephen McAleer
7 months
Some remaining questions after NeurIPS: 1. Will training larger foundation models on the internet lead to advanced agents or do we need agent data? 2. How do we align foundation model agents? 3. Will RLAIF work for agents? 4. Is RL dead or just getting started? 5. Reasoning??
15
12
89
@McaleerStephen
Stephen McAleer
1 year
Excited to share our new preprint on using ensembles for cooperative MARL exploration! The main idea is that agents should keep exploring states that have high uncertainty, because these are the ones that usually require coordination.
Tweet media one
3
12
88
@McaleerStephen
Stephen McAleer
1 year
I'll be attending #ICML2023 next week in Hawaii! I'll be presenting three main conference papers and three workshop papers (🧵below) I will be going on the job market in the fall. Please reach out if you think I will be a good fit!
Tweet media one
2
5
71
@McaleerStephen
Stephen McAleer
8 months
Just wrapped up a great semester teaching computational game solving. Check out all the lectures here:
@McaleerStephen
Stephen McAleer
11 months
I'm co-teaching a course this semester at CMU on computational game solving. The first half focuses on foundations and the second half covers state-of-the-art approaches for large games such as Stratego, Diplomacy, etc.
2
23
121
1
10
72
@McaleerStephen
Stephen McAleer
2 years
We introduce a simple ensemble method to reduce variance in value estimation. MeanQ maintains an ensemble of k Q networks and then takes the max of the average as a target value in the standard TD loss. To appear at #ICML2022 ! (link: )
Tweet media one
4
12
62
@McaleerStephen
Stephen McAleer
6 months
Excited that our paper "Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision Makers" got accepted as a spotlight in ICLR! We show that existing attacks are highly detectable and introduce new undetectable attacks.
Tweet media one
0
5
56
@McaleerStephen
Stephen McAleer
5 months
After using Claude 3 Opus for a week I get annoyed whenever I go back to GPT-4. Seems to be all around better. Excited to try out GPT-4.5 soon!
2
3
54
@McaleerStephen
Stephen McAleer
8 months
Our recent paper shows how computing optimal mechanisms is as easy as solving a two-player zero-sum game. Now we can use game-theoretic reinforcement learning to scalably design optimal auctions! To appear at NeurIPS. Paper link:
Tweet media one
0
4
49
@McaleerStephen
Stephen McAleer
7 months
@_aidan_clark_ The master equation
Tweet media one
4
1
50
@McaleerStephen
Stephen McAleer
8 months
Interested in game-theoretic RL for team games? Check out our Team PSRO paper, to appear in NeurIPS.
Tweet media one
2
6
47
@McaleerStephen
Stephen McAleer
5 years
Our paper "Solving the Rubik's Cube with Deep Reinforcement Learning and Search" has been published in Nature Machine Intelligence. You can check it out here:
4
8
45
@McaleerStephen
Stephen McAleer
1 year
This is joint work with my great co-authors Geunwoo Kim and Pierre Baldi. Please reach out if you have any questions or feedback! 📄 Paper: 🌐 Website: 🐙 GitHub:
1
3
43
@McaleerStephen
Stephen McAleer
5 years
@LesHorn I took an online walking class. Every week you had to take a ten minute walk and then write a 200 word discussion post about your walk. The final was a 30 min walk.
2
1
39
@McaleerStephen
Stephen McAleer
8 months
@bindureddy Actually the opposite, anyone can go and curate the same dataset that OpenAI did to train GPT-4. Now with synthetic data, compute becomes an even bigger barrier between the GPU rich and poor.
2
0
38
@McaleerStephen
Stephen McAleer
2 years
Very excited to share our work on RL for expert-level Stratego! It was a pleasure to work with Julien Perolat, Bart De Vylder, Karl Tuyls, and many other great researchers on this project.
@GoogleDeepMind
Google DeepMind
2 years
DeepNash is an agent trained with model-free multiagent reinforcement learning that learns to play the game of Stratego at expert level. Learn more: 1/
Tweet media one
12
222
1K
0
2
34
@McaleerStephen
Stephen McAleer
6 years
We create an algorithm which is able to teach itself how to solve the Rubik's Cube
@Miles_Brundage
Miles Brundage
6 years
"Solving the Rubik's Cube Without Human Knowledge," McAleer, Agostinelli, and Shmakov et al.:
0
5
21
2
10
34
@McaleerStephen
Stephen McAleer
2 years
Policy Space Response Oracles (PSRO) mixes over a population of deep RL policies to approximate a Nash equilibrium, but exploitability can increase from one iteration to the next. We introduce Anytime PSRO which does not increase exploitability. Arxiv:
Tweet media one
1
8
33
@McaleerStephen
Stephen McAleer
8 months
Reading the Reuters story it sounds like OpenAI's Q* may have aced the MATH benchmark?? Acing GSM8K wouldn't be impressive enough. Who knows though...
3
2
32
@McaleerStephen
Stephen McAleer
8 months
Come by our panel at the Foundation Models for Decision Making workshop at 3:00 today in hall E2!
Tweet media one
0
4
32
@McaleerStephen
Stephen McAleer
4 years
Our paper Optimizing Multiagent Cooperation via Policy Evolution and Shared Experiences got accepted to #icml2020 ! Joint work with Shauharda Khadka, Somdeb Majumdar, Santiago Miret, and Kagan Tumer @IntelAI
Tweet media one
4
1
31
@McaleerStephen
Stephen McAleer
2 years
Accepted at #ICLR2023 !
@McaleerStephen
Stephen McAleer
2 years
How can we find approximate equilibria in games with long horizons? We show that by directly estimating regret with a value function, ESCHER can approximate MCCFR with NNs without using importance sampling! Paper: Code:
Tweet media one
2
4
19
2
0
31
@McaleerStephen
Stephen McAleer
8 months
Excited to be going to #NeurIPS2023 Let me know if you want to grab coffee and speculate about Q*
Tweet media one
3
1
30
@McaleerStephen
Stephen McAleer
4 years
@roydanroy JMLR, AAMAS, COLT, UAI, AISTATS, IJCAI, AAAI, Nature MI
0
0
30
@McaleerStephen
Stephen McAleer
10 months
Don't (over)optimize reward models in RLHF—use them as constraints!! Our method uses constrained RL to minimize the KL to the reference policy while maintaining reward model constraints. Now you can optimize this objective as much as you want!
Tweet media one
@ted_moskovitz
Ted Moskovitz
10 months
Worried your LLM will produce too many paperclips? Simply tell it when to stop – excited to share our new preprint, where we introduce an approach based on constrained RL to avoid overoptimization for compound reward models: 1/
Tweet media one
1
21
120
0
3
26
@McaleerStephen
Stephen McAleer
3 years
Interested in using PSRO to solve a large imperfect-information extensive-form game but worried because PSRO is a normal form algorithm? This paper is for you! We introduce an extensive-form double oracle algorithm (XDO) and scale it up with deep RL (NXDO)
Tweet media one
1
6
25
@McaleerStephen
Stephen McAleer
1 year
RCI prompting first asks the LLM to find problems with the original answer. Then, based on those problems, the LLM can improve the original answer.
Tweet media one
1
0
25
@McaleerStephen
Stephen McAleer
4 years
To appear in #NeurIPS2020
@McaleerStephen
Stephen McAleer
4 years
We introduce Pipeline PSRO, the first scalable, general method for finding approximate Nash equilibria in large games. We demonstrate state-of-the-art performance on Barrage Stratego, a board game much larger than Texas Hold 'Em.
Tweet media one
Tweet media two
1
5
21
0
1
25
@McaleerStephen
Stephen McAleer
4 months
@demishassabis @julesgambit It could happen if you assume that Garry always plays the same move after a given sequence of moves.
4
0
23
@McaleerStephen
Stephen McAleer
4 years
We introduce Pipeline PSRO, the first scalable, general method for finding approximate Nash equilibria in large games. We demonstrate state-of-the-art performance on Barrage Stratego, a board game much larger than Texas Hold 'Em.
Tweet media one
Tweet media two
1
5
21
@McaleerStephen
Stephen McAleer
8 months
I miss the good old days when people would publish their results!
0
2
22
@McaleerStephen
Stephen McAleer
5 years
Excited to be at #iclr2019 to present our paper on solving the Rubik's cube with reinforcement learning.
1
2
19
@McaleerStephen
Stephen McAleer
2 years
How can we find approximate equilibria in games with long horizons? We show that by directly estimating regret with a value function, ESCHER can approximate MCCFR with NNs without using importance sampling! Paper: Code:
Tweet media one
2
4
19
@McaleerStephen
Stephen McAleer
8 months
@finbarrtimbers I'm going to stop publishing my work and just leak it to the press to generate more hype
1
0
16
@McaleerStephen
Stephen McAleer
9 months
Got access to the base GPT-4. Here's the tikz unicorn it drew. Not bad!
Tweet media one
1
0
15
@McaleerStephen
Stephen McAleer
3 years
Excited to share that my recent work on meta-learning Nash solvers will be published in #NeurIPS2021 ! LMAC is able to train on Kuhn poker and generalize to outperform PSRO on Leduc poker. Paper link here:
Tweet media one
3
2
14
@McaleerStephen
Stephen McAleer
8 months
Diversity measures in PSRO seek to find a diverse population of strategies to solve a game. But existing diversity methods may not actually decrease exploitability. We fix this with our method PSD-PSRO. Check it out today at #NeurIPS23 in the poster session this morning.
Tweet media one
0
1
15
@McaleerStephen
Stephen McAleer
5 months
guess the paper
Tweet media one
5
1
14
@McaleerStephen
Stephen McAleer
6 years
Cool article about our work on the Rubik's cube!
@techreview
MIT Technology Review
6 years
A machine has figured out Rubik’s Cube all by itself
5
114
229
0
7
13
@McaleerStephen
Stephen McAleer
5 months
@MinqiJiang introduced me to the concept of the spike. It's not cool to have a highly cited paper because maybe someone else would have written the same one. What's cool is to have a spike in citations which means you're way ahead of the curve.
0
0
13
@McaleerStephen
Stephen McAleer
1 year
For computer tasks, we apply RCI prompting in three stages. First, the LLM generates a high-level plan. Next, based on the plan and the current state, the LLM generates an action. Finally, the LLM formats the action into the correct keyboard or mouse action.
Tweet media one
1
1
10
@McaleerStephen
Stephen McAleer
8 months
I'll be presenting this at #NeurIPS2023 on Wednesday poster session 3 no. 324
@McaleerStephen
Stephen McAleer
1 year
Forget plugins. ChatGPT can solve general computer tasks using a keyboard and mouse!! The trick? Recursively criticizing and improving the output (RCI). We also find that RCI prompting outperforms CoT prompting on reasoning tasks. Paper, website, and GitHub in the 🧵below.
13
75
431
1
0
12
@McaleerStephen
Stephen McAleer
4 years
Proud to join the peaceful #BlackLivesMatter protest in San Clemente today
0
0
11
@McaleerStephen
Stephen McAleer
8 months
@vin_sachi My best guess is they did something like ramped up the verifier from let's verify step by step and put that in a search algorithm as a heuristic. Maybe with an iterative alphazero style self improvement component w/ ground truth data.
3
1
11
@McaleerStephen
Stephen McAleer
5 years
This is what happens when your objective is to maximize users screen time. YouTube needs to drastically change their recommendation system.
@Max_Fisher
Max Fisher
5 years
Now live: Our monthslong project on YouTube radicalization. As YouTube diverts more and more users down far-right rabbitholes, could its algorithm, in a way, radicalize an entire society? To find out, we went to YouTube's 2nd-largest market: Brazil.
187
7K
11K
1
2
10
@McaleerStephen
Stephen McAleer
3 years
We often choose to delegate our decisions to algorithms. What should a central mediator do when multiple people choose to delegate their actions to the same mediator? In our recent paper we propose a mediator which Pareto-improves delegating agents.
1
2
10
@McaleerStephen
Stephen McAleer
3 years
To appear in #NeurIPS2021 !
@McaleerStephen
Stephen McAleer
3 years
Interested in using PSRO to solve a large imperfect-information extensive-form game but worried because PSRO is a normal form algorithm? This paper is for you! We introduce an extensive-form double oracle algorithm (XDO) and scale it up with deep RL (NXDO)
Tweet media one
1
6
25
0
0
9
@McaleerStephen
Stephen McAleer
8 months
@DrJimFan @RichardSSutton To scalably collect synthetic data we need agents interacting in (virtual) environments.
3
0
11
@McaleerStephen
Stephen McAleer
2 years
How can we use machine learning to prove theorems? In our paper at #ICML2022 we show that a transformer network trained with HER and incremental learning learns heuristics that achieve SOTA performance vs. existing ML approaches! paper:
Tweet media one
Tweet media two
Tweet media three
1
2
9
@McaleerStephen
Stephen McAleer
8 months
Insane how if we just solved all of math via automated theorem proving the best way to monitize it would be via software verification
3
0
10
@McaleerStephen
Stephen McAleer
7 months
Would love to see this idea combined with tree search in an expert iteration / alphazero setup.
0
0
10
@McaleerStephen
Stephen McAleer
8 months
Acing MATH would be extremely shocking but not completely outside the realm of possibilities.
2
0
10
@McaleerStephen
Stephen McAleer
1 year
What's remarkable about this approach is that it only requires a handful of demonstrations per task, as opposed to existing methods which require thousands of demonstrations per task.
Tweet media one
2
1
9
@McaleerStephen
Stephen McAleer
9 months
Tweet media one
1
0
9
@McaleerStephen
Stephen McAleer
10 months
It was great helping out with this project! I'm super excited about LLMs for math.
@zhangir_azerbay
Zhangir Azerbayev
10 months
We release Llemma: open LMs for math trained on up to 200B tokens of mathematical text. The performance of Llemma 34B approaches Google's Minerva 62B despite having half the parameters. Models/data/code: Paper: More ⬇️
Tweet media one
11
127
549
0
0
9
@McaleerStephen
Stephen McAleer
5 years
This is a great article about the problems with the attention economy: What is the Price of our Attention? by Quentin LE GARREC
0
1
7
@McaleerStephen
Stephen McAleer
1 year
Not only does RCI prompting improve upon CoT prompting, but we also find that CoT + RCI performs the best out of them all!
Tweet media one
2
2
8
@McaleerStephen
Stephen McAleer
4 years
We use evolutionary algorithms combined with policy gradients to learn from both global and local rewards in multiagent environments.
1
0
7
@McaleerStephen
Stephen McAleer
1 year
Great work by @casdewitt @ssokota @zicokolter @j_foerst and Martin Strohmeier!
@QuantaMagazine
Quanta Magazine
1 year
“As we increasingly become a society where it’s very common to interface with AI models, there are increasingly many opportunities to encode secret information in media that people use all the time.” —Samuel Sokota, a computer scientist at Carnegie Mellon
1
26
67
0
2
7
@McaleerStephen
Stephen McAleer
3 months
I'm in Vienna for #ICLR2024 . Reach out if you want to talk about alignment, agents, AI safety, or anything else! Thread of papers below 👇
Tweet media one
1
0
7
@McaleerStephen
Stephen McAleer
8 months
What is the most promising approach to improving RLAIF besides training a bigger model or doing better prompt engineering? If you finetune the model that's giving AI feedback then it seems like you're back to RLHF.
2
0
6
@McaleerStephen
Stephen McAleer
5 months
Most proprietary data sources probably aren't that valuable unless they are very large, high quality, and hard to find on the internet. For example, a huge dataset of Excel files would be valuable but call transcripts probably wouldn't be.
1
0
6
@McaleerStephen
Stephen McAleer
8 months
@xuanalogue Thanks! We didn't explore that in this paper, but if you were able to train a Q function for dynamic action spaces then you could put it inside A*.
0
0
7
@McaleerStephen
Stephen McAleer
8 months
Let's see if this drive link works:
0
1
6
@McaleerStephen
Stephen McAleer
10 months
This talk by Jack Rae @drjwrae is a great explanation of the connection between lossless compression and predictive modeling in LLMs. We shouldn't care about lossy compression of our data, but rather we expect that better lossless compressors will generalize better. Link 👇
2
0
5
@McaleerStephen
Stephen McAleer
1 year
@qntm This has been done "The Show About the Show"
1
0
6
@McaleerStephen
Stephen McAleer
3 months
Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations Fri 4:30 p.m. - 6:30 p.m. CEST Hall B #144
Tweet media one
1
1
6
@McaleerStephen
Stephen McAleer
3 months
What synthetic data techniques were likely used in Llama 3?
1
0
6
@McaleerStephen
Stephen McAleer
7 months
@CalvinMccarter Thanks, I forgot about this, will give it another read. I still suspect you could fix these problems with more advanced imitation learning on SFT data instead of just behavior cloning.
1
0
6
@McaleerStephen
Stephen McAleer
8 months
@nathanwchan What did he mean by this
Tweet media one
3
0
6
@McaleerStephen
Stephen McAleer
4 years
Cool paper from @rythei et. al.! They use statistical mechanics to characterize the performance of typical linear classifiers instead of the commonly-used worst-case uniform convergence analysis. They find that good lc's are surprisingly abundant even though bad ones still exist
@StatMLPapers
Stat.ML Papers
4 years
Good linear classifiers are abundant in the interpolating regime. (arXiv:2006.12625v1 [])
0
0
9
0
1
6
@McaleerStephen
Stephen McAleer
1 year
@aidangomezzz I agree that we should keep pushing capabilities research but at the same time we need to research AI safety. What happens when an agent can go off and make money on its own by hiring workers/writing code etc.? It could deceive/manipulate people and accumulate power.
0
0
6
@McaleerStephen
Stephen McAleer
4 months
By 2030 anyone will be able to train their own GPT-6
@fiiiiiist
Tim Fist
4 months
This chart from @paul_scharre is an underrated point about AI proliferation: training at the frontier gets expensive (line going up), but at any fixed capability level gets cheap (lines going down) due to better software+hardware (assumes historical scaling rates holds)
Tweet media one
5
12
91
0
2
6
@McaleerStephen
Stephen McAleer
7 months
I'd love to see an experiment doing RLHF without SFT first and comparing to SFT or just doing more SFT instead of RLHF. Has this been done?
3
1
5
@McaleerStephen
Stephen McAleer
7 years
Difficult to overstate how reckless and out of touch this view is
@TheAtlantic
The Atlantic
7 years
Steve Mnuchin is 'not worried at all' about machines displacing American workers, @gillianbwhite reports
Tweet media one
14
13
22
0
0
5
@McaleerStephen
Stephen McAleer
5 months
@sea_snell Things requiring planning like playing tic tac toe, writing a sentence with a certain number of words, etc
2
0
5
@McaleerStephen
Stephen McAleer
1 year
#Bard seems pretty bad at logic. #ChatGPT for comparison.
Tweet media one
Tweet media two
4
0
4
@McaleerStephen
Stephen McAleer
2 years
The main idea is to update the restricted distribution via a no-regret algorithm while the opponent best response is training against it. As a result, the restricted distribution will approximate the least-exploitable distribution, and not increase exploitability.
Tweet media one
Tweet media two
1
0
4
@McaleerStephen
Stephen McAleer
1 year
Language Models can Solve Computer Tasks (AI & HCI Workshop) TLDR: We show that ChatGPT can achieve SoTA on MiniWoB using a technique that recursively criticizes and improves its output.
Tweet media one
1
0
5
@McaleerStephen
Stephen McAleer
1 year
A Game-Theoretic Framework for Managing Risk in Multi-Agent Systems (Main Conference) TLDR: We investigate equilibrium concepts for risk-averse agents and develop a PSRO method to find them.
Tweet media one
1
0
4
@McaleerStephen
Stephen McAleer
1 year
MANSA: Learning Fast and Slow in Multi-Agent Systems (Main Conference) TLDR: We introduce a cooperative RL algorithm that selectively employs centralized learning only at states that require coordination.
Tweet media one
1
0
4
@McaleerStephen
Stephen McAleer
1 year
@karpathy Thanks!! I've been inspired to solve it ever since you built it years ago.
0
0
4
@McaleerStephen
Stephen McAleer
1 year
Adapting Robust Reinforcement Learning to Handle Temporally-Coupled Perturbations (AdvML Workshop) (Not on ArXiv yet) TLDR: We introduce a robust RL framework that temporally constrains the adversary and optimize it with PSRO.
0
0
4
@McaleerStephen
Stephen McAleer
7 months
@rm_rafailov Very exciting!
0
0
4
@McaleerStephen
Stephen McAleer
5 years
@alex_peys I'm currently working on general RL algorithms for these types of games such as Blokus, Chinese checkers and multi-player snake. Very interesting because a Nash equilibrium isn't the best strategy!
0
0
4
@McaleerStephen
Stephen McAleer
1 month
@deedydas @maxim_enis You for sure could but yeah maybe it would be more expensive.
0
0
4
@McaleerStephen
Stephen McAleer
1 year
@GlenBerseth Check out our foundation models for decision making workshop!
0
0
4
@McaleerStephen
Stephen McAleer
2 months
@GarrisonLovely No he's right. Take a look at the new tests like the Hungarian math test for example.
2
0
4