Samuel Sokota @ssokota Twitter profile

Last Seen Profiles

@19wjse

@llotuszz

@two1175124

@SylvainM1602

@03K8JEb3yUgGxa5

@AFPAonline

@HAKAlSATSU

@bokeplokalmalam

@caami_genari0

@Zedloveme

@TastyNetwork

@0salehabusaif0

@mortalienze

@reig_moor

@ALittleLiarPLL5

@quartelz

@denicmarko

@Eymen8785393551

@RsStormy

@SetlingA

@Storytellerngr

@dinocien

@ShitpostGate

@19MoonWolf

@Poripara3699

@jaoneless

@mahol

@Konbinisports

@mintra_pink

@nba_ima

@tama_lotus

@nsm_reborn

@paku_chaudhari

@na8_sara

@fuu67822

Samuel Sokota

@ssokota

6 months

SOTA AI for games like poker & Hanabi rely on search methods that don’t scale to games w/ large amounts of hidden information. In our ICLR paper, we introduce simple search methods that scale to large games & get SOTA for Hanabi w/ 100x less compute. 1/N

5

52

335

Samuel Sokota

@ssokota

4 months

Maximizing mutual information (MI) is a fundamental problem. We’re releasing a package for approximately maximizing MI between any two discrete distributions. This package enables high-throughput perfectly secure steganography with language models. 1/N

3

38

216

Samuel Sokota

@ssokota

1 year

A longstanding disappointment in multi-agent research has been that naive self-play RL fails in adversarial games. In our ICLR paper, we present a simple principled approach that enables self-play RL to succeed in such settings. 1/N

3

46

190

Samuel Sokota

@ssokota

1 year

Imperfect information makes RL and search hard. In our ICML paper, we give an equilibrium-preserving reduction from imperfect information adversarial games to fully observable games, opening the door to new RL methods and simpler search techniques. 1/N

4

37

174

Samuel Sokota

@ssokota

1 year

There are two shapes below: one is named “kiki” and one is named “bouba”. Which is which? This is the puzzle we consider in our ICML paper: Learning Intuitive Policies Using Action Features. 1/N ⚫ ✴

Left bouba; right kiki

199

Left kiki; right bouba

22

4

12

42

Samuel Sokota

@ssokota

6 months

Thanks to @MetaAI for supporting this research, which was conducted during an internship supervised by @polynoamial in collaboration with @gabrfarina , @lightvector1 , @HengyuanH , @often_wang , and @zicokolter . N/N

1

0

15

Samuel Sokota

@ssokota

1 year

If you’re interested in learning more, check out the paper–linked above, our codebases: , , , or reach out to us for questions. N/N

GitHub - ssokota/mmd: Code for magnetic mirror descent.

Code for magnetic mirror descent. Contribute to ssokota/mmd development by creating an account on GitHub.

github.com

0

12

Samuel Sokota

@ssokota

6 months

We also introduce a similar algorithm for adversarial games based on magnetic mirror descent (see thread below) & show it improves performance in games with virtually no public information, where PBS methods are essentially inapplicable. 13/N

Samuel Sokota

@ssokota

1 year

A longstanding disappointment in multi-agent research has been that naive self-play RL fails in adversarial games. In our ICLR paper, we present a simple principled approach that enables self-play RL to succeed in such settings. 1/N

3

46

190

1

10

Samuel Sokota

@ssokota

6 months

These aforementioned poker & Hanabi AIs do search using objects known as public belief states (PBSs). How do these PBS methods work? And what makes them fundamentally unscalable? 2/N

1

0

10

Samuel Sokota

@ssokota

1 year

This work is the culmination of a large collaboration co-led by @RyanDOrazio including @zicokolter , @NicLoizou , @sharky6000 , @bouzoukipunks , @polynoamial , and @ChrKroer spanning @mldcmu , @Mila_Quebec , @HopkinsEngineer , @DeepMind , @MetaAI , and @CUSEAS . 13/N

1

0

8

Samuel Sokota

@ssokota

6 months

Overall, our results suggest that the update-equivalence framework opens the door to effective search in games with large amounts of hidden information. 14/N

1

0

8

Samuel Sokota

@ssokota

6 months

Using this framework of update equivalence, we introduce a provably sound search algorithm for fully cooperative games based on mirror descent. The algorithm works by estimating Q-values using rollouts and performing a mirror descent update on top of the estimated Q-values. 10/N

1

0

9

Samuel Sokota

@ssokota

6 months

Thus, alternative frameworks are required to scale search to games with large amounts of hidden information. In our ICLR paper, we present such a framework that we call the framework of update equivalence. 7/N

1

0

8

Samuel Sokota

@ssokota

6 months

However, step 1) differs between perfect-info games & imperfect-info games. With perfect info, constructing subgames is straightforward (eg, for Go, start the game from the current board). In contrast, with imperfect info, it’s not obvious how to handle hidden information. 4/N

1

0

7

Samuel Sokota

@ssokota

6 months

If you are interested in learning more, reach out with questions, check out the paper below, or come chat at ICLR. 15/N

The Update-Equivalence Framework for Decision-Time Planning

The process of revising (or constructing) a policy at execution time -- known as decision-time planning -- has been key to achieving superhuman performance in perfect-information games like chess...

arxiv.org

2

1

8

Samuel Sokota

@ssokota

4 months

This line of work was made possible by algorithms, such as that of @murat_kocaoglu_ et al. [2017], for minimum-entropy coupling, which is equivalent to MI maximization for two marginals, due to the equality below. 8/N

1

9

Samuel Sokota

@ssokota

4 months

This is an idea we’ve been developing over a series of works. It originated from a question @j_foerst formulated about whether it’s possible to communicate information through Markov decision process trajectories. 5/N

1

0

7

Samuel Sokota

@ssokota

1 year

In tabular experiments, MMD is competitive with CFR – a widely-used algorithm for game solving – in terms of exploitability (the deficit against a best response). This is the first time a standard RL approach has produced competitive performance with CFR. 8/N

1

0

7

Samuel Sokota

@ssokota

1 year

By visualizing MMD’s behavior in the rock-paper-scissors game from before, we can see the qualitative benefit of this stability in practice. 7/N

1

0

6

Samuel Sokota

@ssokota

6 months

We test this algorithm, which we call mirror descent search (MDS), in Hanabi—the most competitive benchmark for search in cooperative games. We show MDS exceeds or matches the performance of PBS methods while using 100x less search time. 11/N

1

0

6

Samuel Sokota

@ssokota

6 months

Because there are only 52 choose 2 hands in Texas Hold’em, the search player can reason about all of these hands. However, as the amount of hidden information grows, this approach quickly becomes infeasible. 6/N

1

0

6

Samuel Sokota

@ssokota

6 months

PBS methods address this issue by using subgames that include every decision point consistent with public knowledge. In Texas Hold’em, this means that each subgame involves not only the hand the search player was actually dealt, but also every other conceivable hand. 5/N

1

0

5

Samuel Sokota

@ssokota

1 year

With this context in mind, our results make us optimistic about the prospect of self-play RL becoming the go-to approach for large adversarial settings. 12/N

1

0

5

Samuel Sokota

@ssokota

6 months

The rough idea is similar to search for perfect-information games: 1) Construct a subgame. 2) Determine a good policy for the subgame. 3/N

1

0

5

Samuel Sokota

@ssokota

1 year

Because of MMD’s close proximity to PPO, it is easy to implement as a deep RL algorithm by modifying existing codebases. In deep RL experiments, MMD becomes more robust to DQN adversaries over the course of training. 9/N

2

0

5

Samuel Sokota

@ssokota

6 months

The core idea is simple: replicate the updates of last-iterate algorithms with desirable properties, such as improvements in expected return, decreases in exploitability, or convergence to an equilibrium. 8/N

1

0

5

Samuel Sokota

@ssokota

1 year

It was originally inspired by conversations with @deepcohen and important earlier work by Julien Perolat and others at @DeepMind . 14/N

From Poincaré Recurrence to Convergence in Imperfect Information Games: Finding Equilibrium via...

In this paper we investigate the Follow the Regularized Leader dynamics in sequential imperfect information games (IIG). We generalize existing results of Po...

proceedings.mlr.press

1

0

4

Samuel Sokota

@ssokota

1 year

This work was done in collaboration with @ma5_mingwei and Jizhou Liu, who jointly led the project, and @maxhkw and @j_foerst , who jointly supervised it. If you are interested in learning more, check out the paper below or come chat at ICML. N/N

Learning Intuitive Policies Using Action Features

An unaddressed challenge in multi-agent coordination is to enable AI agents to exploit the semantic relationships between the features of actions and the features of observations. Humans take...

arxiv.org

0

1

5

Samuel Sokota

@ssokota

1 year

If you’re interested in learning more about our fix or any of these directions, check out the paper or reach out with questions. 12/N

1

0

5

Samuel Sokota

@ssokota

1 year

While there are other approaches to adversarial games besides self play (e.g., best response oracles, regret minimization, or public belief states), they tend to be difficult to scale or slow to converge and can be complicated to implement. 11/N

1

0

4

Samuel Sokota

@ssokota

6 months

This performance gap widens as the amount of hidden information increases, reflecting MDS’s superior scalability compared to PBS methods. 12/N

1

0

4

Samuel Sokota

@ssokota

4 months

The effectiveness of MI maximization for communication led us to investigate its security guarantees. We proved that it is perfectly secure, meaning that it’s statistically impossible to detect, for encrypted secret messages. 6/N

Secret Messages Can Hide in AI-Generated Media | Quanta Magazine

In steganography, an ordinary message masks the presence of a secret communication. Humans can never do it perfectly, but a new study shows it’s possible for machines.

www.quantamagazine.org

1

0

4

Samuel Sokota

@ssokota

1 year

Our ICLR paper shows there is a simple way around this problem that builds on top of mirror descent. 3/N

2

0

4

Samuel Sokota

@ssokota

1 year

As outlined in the video, magnetic mirror descent (MMD) takes the following form, with the additional regularization term shown in blue. The key to getting convergence is to set the stepsize η to be sufficiently small relative to the temperature α. 5/N

1

0

4

Samuel Sokota

@ssokota

6 months

Since last-iterate algorithms do not typically rely on PBSs, the framework of update equivalence scales naturally to games with large amounts of hidden information. 9/N

1

0

4

Samuel Sokota

@ssokota

1 year

We also find that it reliably outperforms PPO with RLlib’s default hyperparameters, NFSP with OpenSpiel’s default hyperparameters, and simple bots in head-to-head matchups. 10/N

1

0

4

Samuel Sokota

@ssokota

1 year

This transformer takes both features of the state (red rectangular prisms below) and features of a particular action (blue rectangular prism below) as input. A separate forward pass is performed for each action to construct the Q-value or logit vector. 7/N

1

0

4

Samuel Sokota

@ssokota

4 months

If you’re interested in learning more, check out our package and the papers below, reach out with questions, or come chat at UAI or ICML. N/N

GitHub - ssokota/mec: Code for minimum-entropy coupling.

Code for minimum-entropy coupling. Contribute to ssokota/mec development by creating an account on GitHub.

github.com

0

1

3

Samuel Sokota

@ssokota

1 year

We prove that there is a simpler solution to this problem, void of these undesirable properties and amenable to model-free RL: Entropy regularization 9/N

1

3

Samuel Sokota

@ssokota

4 months

Steganography is the practice of encoding secret information within seemingly innocuous content. Like cryptography, it enables sending sensitive information over public channels. But unlike cryptography, it hides the very fact that sensitive communication is taking place. 3/N

1

0

4

Samuel Sokota

@ssokota

1 year

Humans use these and other feature-based associations to flexibly communicate in common sense ways. For example, when someone holds up 2 fingers, it’s meant to indicate the number 2 (rather than some other number). 3/N

1

0

3

Samuel Sokota

@ssokota

1 year

These policies achieve human-level zero-shot coordination with other humans, as shown in the table below. Here, zero-shot means that participating humans had no knowledge about the policies of their current or past partners. 9/N

1

0

3

Samuel Sokota

@ssokota

1 year

On the other hand, reinforcement learning (RL) agents don’t usually learn these feature correspondences: In a communication game, two RL agents might learn to use 2 fingers to indicate the number 3 (or another arbitrary number). 4/N

1

0

3

Samuel Sokota

@ssokota

1 year

If you picked Option 1, you’re in the majority! This outcome is called the bouba/kiki effect: It illustrates that humans associate features of sounds with features of shapes. 2/N

1

0

3

Samuel Sokota

@ssokota

1 year

We call this resulting approach magnetic mirror descent. The main idea is to add an additional regularization term that pulls the policy toward a second policy called the magnet. 4/N

1

0

3

Samuel Sokota

@ssokota

4 months

These works were co-led by @casdewitt , in collaboration with @dylanjsam , Spencer Compton, @MaxiIgl , @luisa_zintgraf , @philiptorr , @shimon8282 , @MasorX , @zicokolter , and @j_foerst . 9/N

1

0

3

Samuel Sokota

@ssokota

1 year

Thanks to @MetaAI for supporting this research, which was conducted during an internship supervised by @polynoamial , in collaboration with @RyanDOrazio , @ChunKaiLing1 , @lightvector1 , and @zicokolter . N/N

0

3

Samuel Sokota

@ssokota

4 months

We can do steganography by maximizing the mutual information between a distribution of innocuous content and a distribution of secret messages. Encoding and decoding are performed using the associated conditional distributions. 4/N

1

0

3

Samuel Sokota

@ssokota

1 year

What is imperfect information? Here, imperfect information means that one player has information that another does not (such as in poker) or that two players act simultaneously (like in rock-paper-scissors). 2/N

1

0

2

Samuel Sokota

@ssokota

1 year

Unfortunately, this insight does not work for adversarial games due to the possibility for a player to safely “slack off” later in the game. This "slack off" behavior is safe later in the game because the player’s opponent has already committed to part of its policy. 5/N

1

0

3

Samuel Sokota

@ssokota

1 year

One reason a deep RL agent may lack the inductive bias to learn human-like policies is because its network only takes state features as input. With this architecture, the agent has no information about action features, which are crucial for learning feature correspondences. 5/N

1

0

3

Samuel Sokota

@ssokota

1 year

In short, entropy regularization fixes the problem because it induces unique best responses. As a result, a player moving later in the game is penalized for “slacking off” even if its opponent has already committed to a policy. 10/N

1

2

Samuel Sokota

@ssokota

4 months

Most recently, in a paper to appear at UAI, we extended the approach to unencrypted secret messages, for which it’s possible to achieve high rates of information throughput, as shown in the video at the top of the thread. 7/N

Computing Low-Entropy Couplings for Large-Support Distributions

Minimum-entropy coupling (MEC) -- the process of finding a joint distribution with minimum entropy for given marginals -- has applications in areas such as causality and steganography. However,...

arxiv.org

1

0

2

Samuel Sokota

@ssokota

1 year

In our ICML paper, we investigate whether a different architecture induces the right inductive bias for RL agents to learn human-like feature correspondences. We answer in the affirmative for a particular kind of transformer. 6/N

1

0

2

Samuel Sokota

@ssokota

1 year

The intuition is that the regularization toward the magnet stabilizes the learning dynamics. The smaller the strength of the regularization, the smaller the stepsize need be to maintain stability. 6/N

1

0

2

Samuel Sokota

@ssokota

1 year

@faoliehoek @often_wang @aukejw @DiederikRo It’s very related! Roughly speaking, it shows that computing certain regularized equilibria can be reduced to solving for the analogous equilibria of the centralized model that you, @aukejw , and @DiederikRo introduced.

1

0

2

Samuel Sokota

@ssokota

1 year

What does it mean that self-play RL fails? Ideally we want RL to find policies that are hard for opponents to exploit. Unfortunately, as shown in the video above, even in simple games like rock-paper-scissors, RL can diverge away from such policies. 2/N

1

0

2

Samuel Sokota

@ssokota

1 year

This fix opens the door to numerous promising research directions. These directions include: 1. New methods for tabularly solving games. 2. BAD-like methods for adversarial settings. 3. Simpler (and perhaps better) search-based AI for games like poker. 11/N

1

0

2

Samuel Sokota

@ssokota

1 year

@bronzeagepapi They have important similarities. R-NaD is based on the algorithm from the paper referenced in tweet 14 (friction FoReL), which was an inspiration for the MMD project. The relationship between friction FoReL and MMD is related to that of FTRL and MD.

0

1

2

Samuel Sokota

@ssokota

1 year

This insight is the grounding for approaches to common-payoff games like BAD and SPARTA . 4/N

Improving Policies via Search in Cooperative Partially Observable Games

Recent superhuman results in games have largely been achieved in a variety of zero-sum settings, such as Go and Poker, in which agents need to compete against others. However, just like humans,...

arxiv.org

1

0

2

Samuel Sokota

@ssokota

1 year

To avoid this issue, poker AIs like DeepStack , Libratus , ReBeL , and Player of Games require complicated machinery with undesirable properties, such as discontinuous functions. 7/N

Combining Deep Reinforcement Learning and Search for...

The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of successes in single-agent settings and...

arxiv.org

1

0

2

Samuel Sokota

@ssokota

1 year

The idea of abstracting imperfect information away from games was pioneered by Nayyar et al. , who showed that having players announce their policies as they play removes imperfect information from common-payoff games. 3/N

Decentralized Stochastic Control with Partial History Sharing: A...

A general model of decentralized stochastic control called partial history sharing information structure is presented. In this model, at each step the controllers share part of their observation...

arxiv.org

1

0

2

Samuel Sokota

@ssokota

4 months

What is steganography? And how does maximizing mutual information enable it? 2/N

1

0

3

Samuel Sokota

@ssokota

1 year

In coordination tasks, we find that DQN trained in self play with this transformer finds policies that are intuitive to humans. They employ coordination techniques such as feature similarity and implicatures (illustrated below). 8/N

1

0

2

Samuel Sokota

@ssokota

1 year

For future work, directions include: 1 Examining whether the architecture learns intuitive policies in more complex settings. 2 Learning (rather than assuming) features that facilitate intuitive policies. 3 Understanding why the architecture induces this inductive bias. 10/N

1

0

2

Samuel Sokota

@ssokota

1 year

Furthermore, this additional machinery requires search, precluding the application of model-free RL. 8/N

1

0

1

Samuel Sokota

@ssokota

6 months

@template_rex @polynoamial @metaai @gabrfarina @lightvector1 @HengyuanH @often_wang @zicokolter MCS is multi-ply in that its rollouts can look multiple moves ahead, but it's 1-ply in the sense that it doesn't update its policy at future decision points during search.

1

0

1

Samuel Sokota

@ssokota

6 months

@template_rex @polynoamial @metaai @gabrfarina @lightvector1 @HengyuanH @often_wang @zicokolter Replacing CFR w/ MMD in SoG would work, but it still wouldn't scale to games w/ lots of hidden info b/c of the PBS subgames. Pursuing search algos that work like PBS methods when there's little hidden info, but also scale to lots of hidden info, is a cool direction, though.

0

1

Samuel Sokota

@ssokota

1 year

To illustrate, consider the rock-paper-scissors game below. If Blue announces that it is playing uniform (the Nash equilibrium), then any policy is a best response for Red. As a result, Red can “slack off” and select a bad policy without any risk. 6/N

1

0

1