Interested in learning the mathematical foundations of Reinforcement Learning (RL)? Now is a good time! This semester, we will make videos and lecture notes from my graduate-level RL theory course at Princeton available to the public. Now it's week 1:
Announcing our new work, which shows transformer architecture can be a lot worse than RNN in modeling sequences with long-term correlations such as HMMs, and how to potentially fix it: , joint work with amazing collaborators Jiachen Hu and
@qinghual2020
.
Announcing our new paper which studies OOD generalization under well-specified covariate shift, and proves that surprisingly vanilla MLE without using any importance weights is the best algorithm!
In summer,
@YuanhaoWang3
@qinghual2020
and I propose to learn Nash from human feedback in theory (inspired by the dueling bandit literature). It's truly fascinating to witness the idea being used to create innovative, practically useful algorithms for LLMs
Fast-forward ⏩ alignment research from
@GoogleDeepMind
! Our latest results enhance alignment outcomes in Large Language Models (LLMs). Presenting NashLLM!
Can we match the performance of optimally-tuned SGD without knowing the problem parameters including diameters, Lipschitz/smoothness constants, and noise levels? See our recent work led by amazing student
@ahmedkhaledv2
at
While many existing results in game theory have been focused on finding equilibria, an equally important goal is to learn rationalizable behaviors, which avoid iteratively dominated actions.
Ever wonder how to play multiplayer games (>2 players, such as Mahjong, Poker) well and what would be the ultimate solution? Check our paper on why classical equilibria and existing self-play systems are not enough, and how to address it:
Distinguished Professor Michael Jordan recently sat down with Barbara Rosario (Ph.D. I-School, 2005) for a series called "AI Stories." Jordan's episode was filmed in Trieste, Italy, and titled "Lives, Loves, and Technology."
#BerkeleyStats
Ever wonder what is the principled approach to directly use existing standard (reward-based) RL techniques to handle RL from preferences? I will talk about "Is RLHF More Difficult than Standard RL" at ICML workshop "The Many Facets of Preference-based Learning" today.
@ ICML till Friday, happy to catch up! (Be mindful when using Vienna metro, purchasing a ticket is not enough. Validate the ticket on a tiny blue machine before using it, or face a fine of €100+. There is no mercy for first-time foreign travelers unaware of the regulation😂)
Feel free to stop by our posters on RL theory and parameter-free optimization at NeurIPS! Our amazing student Qinghua Liu
@qinghual2020
is also on the job market this year.
Do you know how to *provably* solve multiagent reinforcement learning problems under partial observability? Check out our NeurIPS poster at Hall J
#616
on Wed 4pm, which gives the first sample-efficient solution for learning partially observable Markov games.
Very glad to meet my old friend
@shaneguML
again! Princeton should also have a cat cafe. Very good business! It erases stress and heals your heart. :-)
@chijinML
and I know for 10 years, but the guy in the middle is a random French data scientist from Israel. *the badge is time log, not name tag*
Pro Tips in Tokyo: if you want to meet random data scientists,
@GoogleAI
researchers,
@Princeton
professors, go to cat cafe
We are excited to announce our recent work with
@YuanhaoWang3
, Dingwen Kong,
@yubai01
, which presents new algorithms and the first sample-efficient guarantees for learning rationalizable equilibria.
CALL FOR PAPERS!!!
Excited to organize our new workshop on "The Many Facets of Preference-based Learning" at this year's ICML, Hawaii, with my amazing co-organizers
@BengsViktor
, Robert Busa-Fekete, Mohammad Ghavamzadeh, and Branislav Kveton.
Website:
Out-of-distribution (OOD) generalization is a core challenge in modern ML/foundation models. We consider the well-specified setting as modern ML systems typically use very large and expressive models. This is a joint work with
@EmilyJge
, Shange Tang, Jianqing Fan, Cong Ma.
@RuntianZhai
Thanks for pointing out your nice paper! While these two works prove related phenomena, the settings and underlying mechanisms are orthogonal and in some sense complementing each other. We will add comparison to your work in our next version.
@bremen79
@durdi4
@ahmedkhaledv2
[1] Thanks for your comments! I would also like to quickly remark on a few points.
A. Having coarse estimates on the upper/lower bound of the parameters is common in ML applications, as most optimizers have either explicit or implicit regularizations.
@iandanforth
Those pathological behaviors can be results of many factors. We will cover principled approaches to handle exploration, function approximation, and later, multiagency, partially observable settings which relate to addressing some bad behaviors you mentioned.
@BishPlsOk
Our paper concerns the basic goal—just winning the game, which might be the first step to address those game, and are already highly non-trivial in the current context. I agree there are many other important things in practical game beyond just winning that worth further study.
@bremen79
@durdi4
@ahmedkhaledv2
[3] B. The lower order term is important especially in the stochastic optimization, given the improvement brought by momentum in SGD also only appears in the lower order terms.
@bremen79
@durdi4
@ahmedkhaledv2
[4] T in practice is often not large enough to enter the asymptotic regime — it may not be significantly larger than some high-order polynomial dependency of problem parameters that appear in the lower order terms of the prior works.
@PandaAshwinee
@ahmedkhaledv2
Final goals are similar, but there are multiple versions of "parameter-free" used by prior work. We call it tuning-free to make our definition more formal/rigorous, and not to be confused with prior definitions. The meat is in algorithms and results instead of definition.
@bremen79
@durdi4
@ahmedkhaledv2
[2] Don't know D_upper? We can simply make it extremely large so that the algorithm will never pass that limit in practice. This is fine since we only pay extra log factors.