Samuel Sokota Profile
Samuel Sokota

@ssokota

658
Followers
206
Following
19
Media
86
Statuses

PhD Student at @CarnegieMellon

Joined December 2015
Don't wanna be here? Send us removal request.
@ssokota
Samuel Sokota
6 months
SOTA AI for games like poker & Hanabi rely on search methods that don’t scale to games w/ large amounts of hidden information. In our ICLR paper, we introduce simple search methods that scale to large games & get SOTA for Hanabi w/ 100x less compute. 1/N
Tweet media one
5
52
335
@ssokota
Samuel Sokota
4 months
Maximizing mutual information (MI) is a fundamental problem. We’re releasing a package for approximately maximizing MI between any two discrete distributions. This package enables high-throughput perfectly secure steganography with language models. 1/N
3
38
216
@ssokota
Samuel Sokota
1 year
A longstanding disappointment in multi-agent research has been that naive self-play RL fails in adversarial games. In our ICLR paper, we present a simple principled approach that enables self-play RL to succeed in such settings. 1/N
3
46
190
@ssokota
Samuel Sokota
1 year
Imperfect information makes RL and search hard. In our ICML paper, we give an equilibrium-preserving reduction from imperfect information adversarial games to fully observable games, opening the door to new RL methods and simpler search techniques. 1/N
Tweet media one
4
37
174
@ssokota
Samuel Sokota
1 year
There are two shapes below: one is named “kiki” and one is named “bouba”. Which is which? This is the puzzle we consider in our ICML paper: Learning Intuitive Policies Using Action Features. 1/N ⚫ ✴
Left bouba; right kiki
199
Left kiki; right bouba
22
4
12
42
@ssokota
Samuel Sokota
6 months
Thanks to @MetaAI for supporting this research, which was conducted during an internship supervised by @polynoamial in collaboration with @gabrfarina , @lightvector1 , @HengyuanH , @often_wang , and @zicokolter . N/N
1
0
15
@ssokota
Samuel Sokota
1 year
If you’re interested in learning more, check out the paper–linked above, our codebases: , , , or reach out to us for questions. N/N
0
0
12
@ssokota
Samuel Sokota
6 months
We also introduce a similar algorithm for adversarial games based on magnetic mirror descent (see thread below) & show it improves performance in games with virtually no public information, where PBS methods are essentially inapplicable. 13/N
@ssokota
Samuel Sokota
1 year
A longstanding disappointment in multi-agent research has been that naive self-play RL fails in adversarial games. In our ICLR paper, we present a simple principled approach that enables self-play RL to succeed in such settings. 1/N
3
46
190
1
1
10
@ssokota
Samuel Sokota
6 months
These aforementioned poker & Hanabi AIs do search using objects known as public belief states (PBSs). How do these PBS methods work? And what makes them fundamentally unscalable? 2/N
1
0
10
@ssokota
Samuel Sokota
1 year
1
0
8
@ssokota
Samuel Sokota
6 months
Overall, our results suggest that the update-equivalence framework opens the door to effective search in games with large amounts of hidden information. 14/N
1
0
8
@ssokota
Samuel Sokota
6 months
Using this framework of update equivalence, we introduce a provably sound search algorithm for fully cooperative games based on mirror descent. The algorithm works by estimating Q-values using rollouts and performing a mirror descent update on top of the estimated Q-values. 10/N
1
0
9
@ssokota
Samuel Sokota
6 months
Thus, alternative frameworks are required to scale search to games with large amounts of hidden information. In our ICLR paper, we present such a framework that we call the framework of update equivalence. 7/N
1
0
8
@ssokota
Samuel Sokota
6 months
However, step 1) differs between perfect-info games & imperfect-info games. With perfect info, constructing subgames is straightforward (eg, for Go, start the game from the current board). In contrast, with imperfect info, it’s not obvious how to handle hidden information. 4/N
1
0
7
@ssokota
Samuel Sokota
4 months
This line of work was made possible by algorithms, such as that of @murat_kocaoglu_ et al. [2017], for minimum-entropy coupling, which is equivalent to MI maximization for two marginals, due to the equality below. 8/N
Tweet media one
1
1
9
@ssokota
Samuel Sokota
4 months
This is an idea we’ve been developing over a series of works. It originated from a question @j_foerst formulated about whether it’s possible to communicate information through Markov decision process trajectories. 5/N
1
0
7
@ssokota
Samuel Sokota
1 year
In tabular experiments, MMD is competitive with CFR – a widely-used algorithm for game solving – in terms of exploitability (the deficit against a best response). This is the first time a standard RL approach has produced competitive performance with CFR. 8/N
Tweet media one
1
0
7
@ssokota
Samuel Sokota
1 year
By visualizing MMD’s behavior in the rock-paper-scissors game from before, we can see the qualitative benefit of this stability in practice. 7/N
1
0
6
@ssokota
Samuel Sokota
6 months
We test this algorithm, which we call mirror descent search (MDS), in Hanabi—the most competitive benchmark for search in cooperative games. We show MDS exceeds or matches the performance of PBS methods while using 100x less search time. 11/N
1
0
6
@ssokota
Samuel Sokota
6 months
Because there are only 52 choose 2 hands in Texas Hold’em, the search player can reason about all of these hands. However, as the amount of hidden information grows, this approach quickly becomes infeasible. 6/N
1
0
6
@ssokota
Samuel Sokota
6 months
PBS methods address this issue by using subgames that include every decision point consistent with public knowledge. In Texas Hold’em, this means that each subgame involves not only the hand the search player was actually dealt, but also every other conceivable hand. 5/N
1
0
5
@ssokota
Samuel Sokota
1 year
With this context in mind, our results make us optimistic about the prospect of self-play RL becoming the go-to approach for large adversarial settings. 12/N
1
0
5
@ssokota
Samuel Sokota
6 months
The rough idea is similar to search for perfect-information games: 1) Construct a subgame. 2) Determine a good policy for the subgame. 3/N
1
0
5
@ssokota
Samuel Sokota
1 year
Because of MMD’s close proximity to PPO, it is easy to implement as a deep RL algorithm by modifying existing codebases. In deep RL experiments, MMD becomes more robust to DQN adversaries over the course of training. 9/N
Tweet media one
2
0
5
@ssokota
Samuel Sokota
6 months
The core idea is simple: replicate the updates of last-iterate algorithms with desirable properties, such as improvements in expected return, decreases in exploitability, or convergence to an equilibrium. 8/N
1
0
5
@ssokota
Samuel Sokota
1 year
This work was done in collaboration with @ma5_mingwei and Jizhou Liu, who jointly led the project, and @maxhkw and @j_foerst , who jointly supervised it. If you are interested in learning more, check out the paper below or come chat at ICML. N/N
0
1
5
@ssokota
Samuel Sokota
1 year
If you’re interested in learning more about our fix or any of these directions, check out the paper or reach out with questions. 12/N
1
0
5
@ssokota
Samuel Sokota
1 year
While there are other approaches to adversarial games besides self play (e.g., best response oracles, regret minimization, or public belief states), they tend to be difficult to scale or slow to converge and can be complicated to implement. 11/N
1
0
4
@ssokota
Samuel Sokota
6 months
This performance gap widens as the amount of hidden information increases, reflecting MDS’s superior scalability compared to PBS methods. 12/N
1
0
4
@ssokota
Samuel Sokota
4 months
The effectiveness of MI maximization for communication led us to investigate its security guarantees. We proved that it is perfectly secure, meaning that it’s statistically impossible to detect, for encrypted secret messages. 6/N
1
0
4
@ssokota
Samuel Sokota
1 year
Our ICLR paper shows there is a simple way around this problem that builds on top of mirror descent. 3/N
2
0
4
@ssokota
Samuel Sokota
1 year
As outlined in the video, magnetic mirror descent (MMD) takes the following form, with the additional regularization term shown in blue. The key to getting convergence is to set the stepsize η to be sufficiently small relative to the temperature α. 5/N
Tweet media one
1
0
4
@ssokota
Samuel Sokota
6 months
Since last-iterate algorithms do not typically rely on PBSs, the framework of update equivalence scales naturally to games with large amounts of hidden information. 9/N
1
0
4
@ssokota
Samuel Sokota
1 year
We also find that it reliably outperforms PPO with RLlib’s default hyperparameters, NFSP with OpenSpiel’s default hyperparameters, and simple bots in head-to-head matchups. 10/N
Tweet media one
1
0
4
@ssokota
Samuel Sokota
1 year
This transformer takes both features of the state (red rectangular prisms below) and features of a particular action (blue rectangular prism below) as input. A separate forward pass is performed for each action to construct the Q-value or logit vector. 7/N
Tweet media one
1
0
4
@ssokota
Samuel Sokota
4 months
If you’re interested in learning more, check out our package and the papers below, reach out with questions, or come chat at UAI or ICML. N/N
0
1
3
@ssokota
Samuel Sokota
1 year
We prove that there is a simpler solution to this problem, void of these undesirable properties and amenable to model-free RL: Entropy regularization 9/N
Tweet media one
1
1
3
@ssokota
Samuel Sokota
4 months
Steganography is the practice of encoding secret information within seemingly innocuous content. Like cryptography, it enables sending sensitive information over public channels. But unlike cryptography, it hides the very fact that sensitive communication is taking place. 3/N
1
0
4
@ssokota
Samuel Sokota
1 year
Humans use these and other feature-based associations to flexibly communicate in common sense ways. For example, when someone holds up 2 fingers, it’s meant to indicate the number 2 (rather than some other number). 3/N
1
0
3
@ssokota
Samuel Sokota
1 year
These policies achieve human-level zero-shot coordination with other humans, as shown in the table below. Here, zero-shot means that participating humans had no knowledge about the policies of their current or past partners. 9/N
Tweet media one
1
0
3
@ssokota
Samuel Sokota
1 year
On the other hand, reinforcement learning (RL) agents don’t usually learn these feature correspondences: In a communication game, two RL agents might learn to use 2 fingers to indicate the number 3 (or another arbitrary number). 4/N
1
0
3
@ssokota
Samuel Sokota
1 year
If you picked Option 1, you’re in th­­­­­e majority! This outcome is called the bouba/kiki effect: It illustrates that humans associate features of sounds with features of shapes. 2/N
1
0
3
@ssokota
Samuel Sokota
1 year
We call this resulting approach magnetic mirror descent. The main idea is to add an additional regularization term that pulls the policy toward a second policy called the magnet. 4/N
1
0
3
@ssokota
Samuel Sokota
4 months
These works were co-led by @casdewitt , in collaboration with @dylanjsam , Spencer Compton, @MaxiIgl , @luisa_zintgraf , @philiptorr , @shimon8282 , @MasorX , @zicokolter , and @j_foerst . 9/N
1
0
3
@ssokota
Samuel Sokota
1 year
Thanks to @MetaAI for supporting this research, which was conducted during an internship supervised by @polynoamial , in collaboration with @RyanDOrazio , @ChunKaiLing1 , @lightvector1 , and @zicokolter . N/N
0
0
3
@ssokota
Samuel Sokota
4 months
We can do steganography by maximizing the mutual information between a distribution of innocuous content and a distribution of secret messages. Encoding and decoding are performed using the associated conditional distributions. 4/N
1
0
3
@ssokota
Samuel Sokota
1 year
What is imperfect information? Here, imperfect information means that one player has information that another does not (such as in poker) or that two players act simultaneously (like in rock-paper-scissors). 2/N
1
0
2
@ssokota
Samuel Sokota
1 year
Unfortunately, this insight does not work for adversarial games due to the possibility for a player to safely “slack off” later in the game. This "slack off" behavior is safe later in the game because the player’s opponent has already committed to part of its policy. 5/N
1
0
3
@ssokota
Samuel Sokota
1 year
One reason a deep RL agent may lack the inductive bias to learn human-like policies is because its network only takes state features as input. With this architecture, the agent has no information about action features, which are crucial for learning feature correspondences. 5/N
Tweet media one
1
0
3
@ssokota
Samuel Sokota
1 year
In short, entropy regularization fixes the problem because it induces unique best responses. As a result, a player moving later in the game is penalized for “slacking off” even if its opponent has already committed to a policy. 10/N
1
1
2
@ssokota
Samuel Sokota
4 months
Most recently, in a paper to appear at UAI, we extended the approach to unencrypted secret messages, for which it’s possible to achieve high rates of information throughput, as shown in the video at the top of the thread. 7/N
1
0
2
@ssokota
Samuel Sokota
1 year
In our ICML paper, we investigate whether a different architecture induces the right inductive bias for RL agents to learn human-like feature correspondences. We answer in the affirmative for a particular kind of transformer. 6/N
1
0
2
@ssokota
Samuel Sokota
1 year
The intuition is that the regularization toward the magnet stabilizes the learning dynamics. The smaller the strength of the regularization, the smaller the stepsize need be to maintain stability. 6/N
1
0
2
@ssokota
Samuel Sokota
1 year
@faoliehoek @often_wang @aukejw @DiederikRo It’s very related! Roughly speaking, it shows that computing certain regularized equilibria can be reduced to solving for the analogous equilibria of the centralized model that you, @aukejw , and @DiederikRo introduced.
1
0
2
@ssokota
Samuel Sokota
1 year
What does it mean that self-play RL fails? Ideally we want RL to find policies that are hard for opponents to exploit. Unfortunately, as shown in the video above, even in simple games like rock-paper-scissors, RL can diverge away from such policies. 2/N
1
0
2
@ssokota
Samuel Sokota
1 year
This fix opens the door to numerous promising research directions. These directions include: 1. New methods for tabularly solving games. 2. BAD-like methods for adversarial settings. 3. Simpler (and perhaps better) search-based AI for games like poker. 11/N
1
0
2
@ssokota
Samuel Sokota
1 year
@bronzeagepapi They have important similarities. R-NaD is based on the algorithm from the paper referenced in tweet 14 (friction FoReL), which was an inspiration for the MMD project. The relationship between friction FoReL and MMD is related to that of FTRL and MD.
0
1
2
@ssokota
Samuel Sokota
1 year
To avoid this issue, poker AIs like DeepStack , Libratus , ReBeL , and Player of Games require complicated machinery with undesirable properties, such as discontinuous functions. 7/N
1
0
2
@ssokota
Samuel Sokota
1 year
The idea of abstracting imperfect information away from games was pioneered by Nayyar et al. , who showed that having players announce their policies as they play removes imperfect information from common-payoff games. 3/N
1
0
2
@ssokota
Samuel Sokota
4 months
What is steganography? And how does maximizing mutual information enable it? 2/N
1
0
3
@ssokota
Samuel Sokota
1 year
In coordination tasks, we find that DQN trained in self play with this transformer finds policies that are intuitive to humans. They employ coordination techniques such as feature similarity and implicatures (illustrated below). 8/N
Tweet media one
1
0
2
@ssokota
Samuel Sokota
1 year
For future work, directions include: 1 Examining whether the architecture learns intuitive policies in more complex settings. 2 Learning (rather than assuming) features that facilitate intuitive policies. 3 Understanding why the architecture induces this inductive bias. 10/N
1
0
2
@ssokota
Samuel Sokota
1 year
Furthermore, this additional machinery requires search, precluding the application of model-free RL. 8/N
1
0
1
@ssokota
Samuel Sokota
6 months
@template_rex @polynoamial @metaai @gabrfarina @lightvector1 @HengyuanH @often_wang @zicokolter MCS is multi-ply in that its rollouts can look multiple moves ahead, but it's 1-ply in the sense that it doesn't update its policy at future decision points during search.
1
0
1
@ssokota
Samuel Sokota
6 months
@template_rex @polynoamial @metaai @gabrfarina @lightvector1 @HengyuanH @often_wang @zicokolter Replacing CFR w/ MMD in SoG would work, but it still wouldn't scale to games w/ lots of hidden info b/c of the PBS subgames. Pursuing search algos that work like PBS methods when there's little hidden info, but also scale to lots of hidden info, is a cool direction, though.
0
0
1
@ssokota
Samuel Sokota
1 year
To illustrate, consider the rock-paper-scissors game below. If Blue announces that it is playing uniform (the Nash equilibrium), then any policy is a best response for Red. As a result, Red can “slack off” and select a bad policy without any risk. 6/N
Tweet media one
1
0
1