Stephen McAleer @McaleerStephen Twitter profile

Pinned Tweet

Stephen McAleer

8 months

"Toward General Virtual Agents" I recently gave a talk at MIT. I argued that we should use tools from reinforcement learning and search to improve the capability and alignment of LLM agents. Slides: Video:

5

78

504

Last Seen Profiles

@ermahgerdgerby

@TamYossi

@OptimusGrind__

@jojomaki1212

@wellcom12345

@Lewysko

@jollytrea

@JSPuydeDome

@PopsJr4

@syleger

@Goldensoph67668

@MadamSayma

@OurGleaming

@Candy__R44

@ENG_on_sol

@taylasimonexo

@lostyouinmay

@MirtaTepez14166

@Scare158Steve

@abal_marta

@manaludls_

@Shnim

@OPitdigger

@IamAsimm7

@kokoro_n0303

@detestarao

@nepvr

@scotthelton7

@baksogemoy

@cat_hume

@bomberosucvmcy

@bokeplokalmalam

@JosefislamCom

@sdmbindki

@Rohit_bhill

@jakemccord2

Stephen McAleer

@McaleerStephen

8 months

We invented Q* first Glad openai is building on top of our idea

61

292

3K

Stephen McAleer

@McaleerStephen

2 months

I joined OpenAI where I'm researching agent safety! As we approach AGI, ensuring that powerful agents are aligned and safe will become the key bottleneck to making them useful. Our team is hiring, consider applying!

Research Scientist, Model Safety | OpenAI

Safety Systems · San Francisco · FullTime

openai.com

39

21

539

Stephen McAleer

@McaleerStephen

1 year

Forget plugins. ChatGPT can solve general computer tasks using a keyboard and mouse!! The trick? Recursively criticizing and improving the output (RCI). We also find that RCI prompting outperforms CoT prompting on reasoning tasks. Paper, website, and GitHub in the 🧵below.

13

75

431

Stephen McAleer

@McaleerStephen

8 months

Guys this was just a joke I don't know what openai q* is 😂

13

5

273

Stephen McAleer

@McaleerStephen

7 months

This is an interesting paper that learns a process reward model without human annotations. The idea is to evaluate the accuracy of full reasoning traces generated from a given partial reasoning step. Nice to see Llemma-34B getting 47.3% on MATH!

4

25

185

Stephen McAleer

@McaleerStephen

7 months

Why do RLHF at all? If you had enough SFT data could you get the same performance? Is preference data just easier to collect? There is less information in preference data--e.g. you could infer preferences from SFT data using inverse RL. Would love references on this!

22

7

142

Stephen McAleer

@McaleerStephen

9 months

AI Alignment: A Comprehensive Survey We break AI alignment into four categories: 1. Learning from feedback (e.g. RLHF) 2. Learning under distribution shift 3. Assurance (e.g. interpretability) 4. Governance Reply with any references we missed!

3

24

135

Stephen McAleer

@McaleerStephen

11 months

I'm co-teaching a course this semester at CMU on computational game solving. The first half focuses on foundations and the second half covers state-of-the-art approaches for large games such as Stratego, Diplomacy, etc.

2

23

121

Stephen McAleer

@McaleerStephen

11 months

I'm excited to be co-organizing the Foundation Models for Decision Making workshop at NeurIPS! Please consider submitting your recent work on LLM agents, foundation models + RL, RLHF, LLMs + search, etc. The deadline is in one month on October 1st.

3

16

96

Stephen McAleer

@McaleerStephen

7 months

Some remaining questions after NeurIPS: 1. Will training larger foundation models on the internet lead to advanced agents or do we need agent data? 2. How do we align foundation model agents? 3. Will RLAIF work for agents? 4. Is RL dead or just getting started? 5. Reasoning??

15

12

89

Stephen McAleer

@McaleerStephen

1 year

Excited to share our new preprint on using ensembles for cooperative MARL exploration! The main idea is that agents should keep exploring states that have high uncertainty, because these are the ones that usually require coordination.

3

12

88

Stephen McAleer

@McaleerStephen

1 year

I'll be attending #ICML2023 next week in Hawaii! I'll be presenting three main conference papers and three workshop papers (🧵below) I will be going on the job market in the fall. Please reach out if you think I will be a good fit!

2

5

71

Stephen McAleer

@McaleerStephen

8 months

Just wrapped up a great semester teaching computational game solving. Check out all the lectures here:

Stephen McAleer

@McaleerStephen

11 months

I'm co-teaching a course this semester at CMU on computational game solving. The first half focuses on foundations and the second half covers state-of-the-art approaches for large games such as Stratego, Diplomacy, etc.

2

23

121

1

10

72

Stephen McAleer

@McaleerStephen

2 years

We introduce a simple ensemble method to reduce variance in value estimation. MeanQ maintains an ensemble of k Q networks and then takes the max of the average as a target value in the standard TD loss. To appear at #ICML2022 ! (link: )

4

12

62

Stephen McAleer

@McaleerStephen

6 months

Excited that our paper "Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision Makers" got accepted as a spotlight in ICLR! We show that existing attacks are highly detectable and introduce new undetectable attacks.

0

5

56

Stephen McAleer

@McaleerStephen

5 months

After using Claude 3 Opus for a week I get annoyed whenever I go back to GPT-4. Seems to be all around better. Excited to try out GPT-4.5 soon!

2

3

54

Stephen McAleer

@McaleerStephen

8 months

Our recent paper shows how computing optimal mechanisms is as easy as solving a two-player zero-sum game. Now we can use game-theoretic reinforcement learning to scalably design optimal auctions! To appear at NeurIPS. Paper link:

0

4

49

Stephen McAleer

@McaleerStephen

7 months

@_aidan_clark_ The master equation

4

1

50

Stephen McAleer

@McaleerStephen

8 months

Interested in game-theoretic RL for team games? Check out our Team PSRO paper, to appear in NeurIPS.

2

6

47

Stephen McAleer

@McaleerStephen

5 years

Our paper "Solving the Rubik's Cube with Deep Reinforcement Learning and Search" has been published in Nature Machine Intelligence. You can check it out here:

4

8

45

Stephen McAleer

@McaleerStephen

1 year

This is joint work with my great co-authors Geunwoo Kim and Pierre Baldi. Please reach out if you have any questions or feedback! 📄 Paper: 🌐 Website: 🐙 GitHub:

GitHub - posgnu/rci-agent: A codebase for "Language Models can Solve Computer Tasks"

A codebase for "Language Models can Solve Computer Tasks" - posgnu/rci-agent

github.com

1

3

43

Stephen McAleer

@McaleerStephen

5 years

@LesHorn I took an online walking class. Every week you had to take a ten minute walk and then write a 200 word discussion post about your walk. The final was a 30 min walk.

2

1

39

Stephen McAleer

@McaleerStephen

8 months

@bindureddy Actually the opposite, anyone can go and curate the same dataset that OpenAI did to train GPT-4. Now with synthetic data, compute becomes an even bigger barrier between the GPU rich and poor.

2

0

38

Stephen McAleer

@McaleerStephen

2 years

Very excited to share our work on RL for expert-level Stratego! It was a pleasure to work with Julien Perolat, Bart De Vylder, Karl Tuyls, and many other great researchers on this project.

Google DeepMind

@GoogleDeepMind

2 years

DeepNash is an agent trained with model-free multiagent reinforcement learning that learns to play the game of Stratego at expert level. Learn more: 1/

12

222

1K

0

2

34

Stephen McAleer

@McaleerStephen

6 years

We create an algorithm which is able to teach itself how to solve the Rubik's Cube

Miles Brundage

@Miles_Brundage

6 years

"Solving the Rubik's Cube Without Human Knowledge," McAleer, Agostinelli, and Shmakov et al.:

0

5

21

2

10

34

Stephen McAleer

@McaleerStephen

2 years

Policy Space Response Oracles (PSRO) mixes over a population of deep RL policies to approximate a Nash equilibrium, but exploitability can increase from one iteration to the next. We introduce Anytime PSRO which does not increase exploitability. Arxiv:

1

8

33

Stephen McAleer

@McaleerStephen

8 months

Reading the Reuters story it sounds like OpenAI's Q* may have aced the MATH benchmark?? Acing GSM8K wouldn't be impressive enough. Who knows though...

3

2

32

Stephen McAleer

@McaleerStephen

8 months

Come by our panel at the Foundation Models for Decision Making workshop at 3:00 today in hall E2!

0

4

32

Stephen McAleer

@McaleerStephen

4 years

Our paper Optimizing Multiagent Cooperation via Policy Evolution and Shared Experiences got accepted to #icml2020 ! Joint work with Shauharda Khadka, Somdeb Majumdar, Santiago Miret, and Kagan Tumer @IntelAI

4

1

31

Stephen McAleer

@McaleerStephen

2 years

Accepted at #ICLR2023 !

Stephen McAleer

@McaleerStephen

2 years

How can we find approximate equilibria in games with long horizons? We show that by directly estimating regret with a value function, ESCHER can approximate MCCFR with NNs without using importance sampling! Paper: Code:

2

4

19

2

0

31

Stephen McAleer

@McaleerStephen

8 months

Excited to be going to #NeurIPS2023 Let me know if you want to grab coffee and speculate about Q*

3

1

30

Stephen McAleer

@McaleerStephen

4 years

@roydanroy JMLR, AAMAS, COLT, UAI, AISTATS, IJCAI, AAAI, Nature MI

0

30

Stephen McAleer

@McaleerStephen

10 months

Don't (over)optimize reward models in RLHF—use them as constraints!! Our method uses constrained RL to minimize the KL to the reference policy while maintaining reward model constraints. Now you can optimize this objective as much as you want!

Ted Moskovitz

@ted_moskovitz

10 months

Worried your LLM will produce too many paperclips? Simply tell it when to stop – excited to share our new preprint, where we introduce an approach based on constrained RL to avoid overoptimization for compound reward models: 1/

1

21

120

0

3

26

Stephen McAleer

@McaleerStephen

3 years

Interested in using PSRO to solve a large imperfect-information extensive-form game but worried because PSRO is a normal form algorithm? This paper is for you! We introduce an extensive-form double oracle algorithm (XDO) and scale it up with deep RL (NXDO)

1

6

25

Stephen McAleer

@McaleerStephen

1 year

RCI prompting first asks the LLM to find problems with the original answer. Then, based on those problems, the LLM can improve the original answer.

1

0

25

Stephen McAleer

@McaleerStephen

4 years

To appear in #NeurIPS2020

Stephen McAleer

@McaleerStephen

4 years

We introduce Pipeline PSRO, the first scalable, general method for finding approximate Nash equilibria in large games. We demonstrate state-of-the-art performance on Barrage Stratego, a board game much larger than Texas Hold 'Em.

1

5

21

0

1

25

Stephen McAleer

@McaleerStephen

4 months

@demishassabis @julesgambit It could happen if you assume that Garry always plays the same move after a given sequence of moves.

4

0

23

Stephen McAleer

@McaleerStephen

4 years

We introduce Pipeline PSRO, the first scalable, general method for finding approximate Nash equilibria in large games. We demonstrate state-of-the-art performance on Barrage Stratego, a board game much larger than Texas Hold 'Em.

1

5

21

Stephen McAleer

@McaleerStephen

8 months

I miss the good old days when people would publish their results!

0

2

22

Stephen McAleer

@McaleerStephen

5 years

Excited to be at #iclr2019 to present our paper on solving the Rubik's cube with reinforcement learning.

Solving the Rubik's Cube with Approximate Policy Iteration

We solve the Rubik's Cube with pure reinforcement learning

openreview.net

1

2

19

Stephen McAleer

@McaleerStephen

2 years

How can we find approximate equilibria in games with long horizons? We show that by directly estimating regret with a value function, ESCHER can approximate MCCFR with NNs without using importance sampling! Paper: Code:

2

4

19

Stephen McAleer

@McaleerStephen

8 months

@finbarrtimbers I'm going to stop publishing my work and just leak it to the press to generate more hype

1

0

16

Stephen McAleer

@McaleerStephen

9 months

Got access to the base GPT-4. Here's the tikz unicorn it drew. Not bad!

1

0

15

Stephen McAleer

@McaleerStephen

3 years

Excited to share that my recent work on meta-learning Nash solvers will be published in #NeurIPS2021 ! LMAC is able to train on Kuhn poker and generalize to outperform PSRO on Leduc poker. Paper link here:

3

2

14

Stephen McAleer

@McaleerStephen

8 months

Diversity measures in PSRO seek to find a diverse population of strategies to solve a game. But existing diversity methods may not actually decrease exploitability. We fix this with our method PSD-PSRO. Check it out today at #NeurIPS23 in the poster session this morning.

0

1

15

Stephen McAleer

@McaleerStephen

5 months

guess the paper

5

1

14

Stephen McAleer

@McaleerStephen

6 years

Cool article about our work on the Rubik's cube!

MIT Technology Review

@techreview

6 years

A machine has figured out Rubik’s Cube all by itself

5

114

229

0

7

13

Stephen McAleer

@McaleerStephen

5 months

@MinqiJiang introduced me to the concept of the spike. It's not cool to have a highly cited paper because maybe someone else would have written the same one. What's cool is to have a spike in citations which means you're way ahead of the curve.

0

13

Stephen McAleer

@McaleerStephen

1 year

For computer tasks, we apply RCI prompting in three stages. First, the LLM generates a high-level plan. Next, based on the plan and the current state, the LLM generates an action. Finally, the LLM formats the action into the correct keyboard or mouse action.

1

10

Stephen McAleer

@McaleerStephen

8 months

I'll be presenting this at #NeurIPS2023 on Wednesday poster session 3 no. 324

Stephen McAleer

@McaleerStephen

1 year

Forget plugins. ChatGPT can solve general computer tasks using a keyboard and mouse!! The trick? Recursively criticizing and improving the output (RCI). We also find that RCI prompting outperforms CoT prompting on reasoning tasks. Paper, website, and GitHub in the 🧵below.

13

75

431

1

0

12

Stephen McAleer

@McaleerStephen

4 years

Proud to join the peaceful #BlackLivesMatter protest in San Clemente today

0

11

Stephen McAleer

@McaleerStephen

8 months

@vin_sachi My best guess is they did something like ramped up the verifier from let's verify step by step and put that in a search algorithm as a heuristic. Maybe with an iterative alphazero style self improvement component w/ ground truth data.

3

1

11

Stephen McAleer

@McaleerStephen

5 years

This is what happens when your objective is to maximize users screen time. YouTube needs to drastically change their recommendation system.

Max Fisher

@Max_Fisher

5 years

Now live: Our monthslong project on YouTube radicalization. As YouTube diverts more and more users down far-right rabbitholes, could its algorithm, in a way, radicalize an entire society? To find out, we went to YouTube's 2nd-largest market: Brazil.

187

7K

11K

1

2

10

Stephen McAleer

@McaleerStephen

3 years

We often choose to delegate our decisions to algorithms. What should a central mediator do when multiple people choose to delegate their actions to the same mediator? In our recent paper we propose a mediator which Pareto-improves delegating agents.

Improving Social Welfare While Preserving Autonomy via a Pareto Mediator

Machine learning algorithms often make decisions on behalf of agents with varied and sometimes conflicting interests. In domains where agents can choose to take their own action or delegate their...

arxiv.org

1

2

10

Stephen McAleer

@McaleerStephen

3 years

To appear in #NeurIPS2021 !

Stephen McAleer

@McaleerStephen

3 years

Interested in using PSRO to solve a large imperfect-information extensive-form game but worried because PSRO is a normal form algorithm? This paper is for you! We introduce an extensive-form double oracle algorithm (XDO) and scale it up with deep RL (NXDO)

1

6

25

0

9

Stephen McAleer

@McaleerStephen

8 months

@DrJimFan @RichardSSutton To scalably collect synthetic data we need agents interacting in (virtual) environments.

3

0

11

Stephen McAleer

@McaleerStephen

2 years

How can we use machine learning to prove theorems? In our paper at #ICML2022 we show that a transformer network trained with HER and incremental learning learns heuristics that achieve SOTA performance vs. existing ML approaches! paper:

1

2

9

Stephen McAleer

@McaleerStephen

8 months

Insane how if we just solved all of math via automated theorem proving the best way to monitize it would be via software verification

3

0

10

Stephen McAleer

@McaleerStephen

7 months

Would love to see this idea combined with tree search in an expert iteration / alphazero setup.

0

10

Stephen McAleer

@McaleerStephen

8 months

Acing MATH would be extremely shocking but not completely outside the realm of possibilities.

2

0

10

Stephen McAleer

@McaleerStephen

1 year

What's remarkable about this approach is that it only requires a handful of demonstrations per task, as opposed to existing methods which require thousands of demonstrations per task.

2

1

9

Stephen McAleer

@McaleerStephen

9 months

1

0

9

Stephen McAleer

@McaleerStephen

10 months

It was great helping out with this project! I'm super excited about LLMs for math.

Zhangir Azerbayev

@zhangir_azerbay

10 months

We release Llemma: open LMs for math trained on up to 200B tokens of mathematical text. The performance of Llemma 34B approaches Google's Minerva 62B despite having half the parameters. Models/data/code: Paper: More ⬇️

11

127

549

0

9

Stephen McAleer

@McaleerStephen

3 years

@NoahGuzman14 Not a theorem but there is some intuition in the introduction here:

Variational Inference: A Review for Statisticians

One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference...

arxiv.org

1

0

8

Stephen McAleer

@McaleerStephen

5 years

This is a great article about the problems with the attention economy: What is the Price of our Attention? by Quentin LE GARREC

What is the Price of our Attention?

By Quentin Le Garrec

medium.com

0

1

7

Stephen McAleer

@McaleerStephen

1 year

Not only does RCI prompting improve upon CoT prompting, but we also find that CoT + RCI performs the best out of them all!

2

8

Stephen McAleer

@McaleerStephen

4 years

We use evolutionary algorithms combined with policy gradients to learn from both global and local rewards in multiagent environments.

1

0

7

Stephen McAleer

@McaleerStephen

1 year

Great work by @casdewitt @ssokota @zicokolter @j_foerst and Martin Strohmeier!

Quanta Magazine

@QuantaMagazine

1 year

“As we increasingly become a society where it’s very common to interface with AI models, there are increasingly many opportunities to encode secret information in media that people use all the time.” —Samuel Sokota, a computer scientist at Carnegie Mellon

1

26

67

0

2

7

Stephen McAleer

@McaleerStephen

3 months

I'm in Vienna for #ICLR2024 . Reach out if you want to talk about alignment, agents, AI safety, or anything else! Thread of papers below 👇

1

0

7

Stephen McAleer

@McaleerStephen

8 months

What is the most promising approach to improving RLAIF besides training a bigger model or doing better prompt engineering? If you finetune the model that's giving AI feedback then it seems like you're back to RLHF.

2

0

6

Stephen McAleer

@McaleerStephen

5 months

Most proprietary data sources probably aren't that valuable unless they are very large, high quality, and hard to find on the internet. For example, a huge dataset of Excel files would be valuable but call transcripts probably wouldn't be.

1

0

6

Stephen McAleer

@McaleerStephen

8 months

@xuanalogue Thanks! We didn't explore that in this paper, but if you were able to train a Q function for dynamic action spaces then you could put it inside A*.

0

7

Stephen McAleer

@McaleerStephen

8 months

Let's see if this drive link works:

gen-virt-pdf.pdf

drive.google.com

0

1

6

Stephen McAleer

@McaleerStephen

10 months

This talk by Jack Rae @drjwrae is a great explanation of the connection between lossless compression and predictive modeling in LLMs. We shouldn't care about lossy compression of our data, but rather we expect that better lossless compressors will generalize better. Link 👇

2

0

5

Stephen McAleer

@McaleerStephen

1 year

@qntm This has been done "The Show About the Show"

The Show About the Show (2017) ⭐ 8.2 | Comedy

1h 39m

m.imdb.com

1

0

6

Stephen McAleer

@McaleerStephen

3 months

Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations Fri 4:30 p.m. - 6:30 p.m. CEST Hall B #144

1

6

Stephen McAleer

@McaleerStephen

3 months

What synthetic data techniques were likely used in Llama 3?

1

0

6

Stephen McAleer

@McaleerStephen

7 months

@CalvinMccarter Thanks, I forgot about this, will give it another read. I still suspect you could fix these problems with more advanced imitation learning on SFT data instead of just behavior cloning.

1

0

6

Stephen McAleer

@McaleerStephen

8 months

@nathanwchan What did he mean by this

3

0

6

Stephen McAleer

@McaleerStephen

4 years

Cool paper from @rythei et. al.! They use statistical mechanics to characterize the performance of typical linear classifiers instead of the commonly-used worst-case uniform convergence analysis. They find that good lc's are surprisingly abundant even though bad ones still exist

Stat.ML Papers

@StatMLPapers

4 years

Good linear classifiers are abundant in the interpolating regime. (arXiv:2006.12625v1 [])

0

9

0

1

6

Stephen McAleer

@McaleerStephen

1 year

@aidangomezzz I agree that we should keep pushing capabilities research but at the same time we need to research AI safety. What happens when an agent can go off and make money on its own by hiring workers/writing code etc.? It could deceive/manipulate people and accumulate power.

0

6

Stephen McAleer

@McaleerStephen

4 months

By 2030 anyone will be able to train their own GPT-6

Tim Fist

@fiiiiiist

4 months

This chart from @paul_scharre is an underrated point about AI proliferation: training at the frontier gets expensive (line going up), but at any fixed capability level gets cheap (lines going down) due to better software+hardware (assumes historical scaling rates holds)

5

12

91

0

2

6

Stephen McAleer

@McaleerStephen

7 months

I'd love to see an experiment doing RLHF without SFT first and comparing to SFT or just doing more SFT instead of RLHF. Has this been done?

3

1

5

Stephen McAleer

@McaleerStephen

7 years

Difficult to overstate how reckless and out of touch this view is

The Atlantic

@TheAtlantic

7 years

Steve Mnuchin is 'not worried at all' about machines displacing American workers, @gillianbwhite reports

14

13

22

0

5

Stephen McAleer

@McaleerStephen

5 months

@sea_snell Things requiring planning like playing tic tac toe, writing a sentence with a certain number of words, etc

2

0

5

Stephen McAleer

@McaleerStephen

1 year

#Bard seems pretty bad at logic. #ChatGPT for comparison.

4

0

4

Stephen McAleer

@McaleerStephen

2 years

The main idea is to update the restricted distribution via a no-regret algorithm while the opponent best response is training against it. As a result, the restricted distribution will approximate the least-exploitable distribution, and not increase exploitability.

1

0

4

Stephen McAleer

@McaleerStephen

1 year

Language Models can Solve Computer Tasks (AI & HCI Workshop) TLDR: We show that ChatGPT can achieve SoTA on MiniWoB using a technique that recursively criticizes and improves its output.

1

0

5

Stephen McAleer

@McaleerStephen

1 year

A Game-Theoretic Framework for Managing Risk in Multi-Agent Systems (Main Conference) TLDR: We investigate equilibrium concepts for risk-averse agents and develop a PSRO method to find them.

1

0

4

Stephen McAleer

@McaleerStephen

1 year

MANSA: Learning Fast and Slow in Multi-Agent Systems (Main Conference) TLDR: We introduce a cooperative RL algorithm that selectively employs centralized learning only at states that require coordination.

1

0

4

Stephen McAleer

@McaleerStephen

1 year

@karpathy Thanks!! I've been inspired to solve it ever since you built it years ago.

0

4

Stephen McAleer

@McaleerStephen

1 year

Adapting Robust Reinforcement Learning to Handle Temporally-Coupled Perturbations (AdvML Workshop) (Not on ArXiv yet) TLDR: We introduce a robust RL framework that temporally constrains the adversary and optimize it with PSRO.

0

4

Stephen McAleer

@McaleerStephen

7 months

@rm_rafailov Very exciting!

0

4

Stephen McAleer

@McaleerStephen

5 years

@alex_peys I'm currently working on general RL algorithms for these types of games such as Blokus, Chinese checkers and multi-player snake. Very interesting because a Nash equilibrium isn't the best strategy!

0

4

Stephen McAleer

@McaleerStephen

1 month

@deedydas @maxim_enis You for sure could but yeah maybe it would be more expensive.

0

4

Stephen McAleer

@McaleerStephen

1 year

@GlenBerseth Check out our foundation models for decision making workshop!

0

4

Stephen McAleer

@McaleerStephen

2 months

@GarrisonLovely No he's right. Take a look at the new tests like the Hungarian math test for example.

2

0

4