Ted Moskovitz Profile Banner
Ted Moskovitz Profile
Ted Moskovitz

@ted_moskovitz

1,086
Followers
219
Following
32
Media
287
Statuses

multimodal pretraining at @AnthropicAI . previously: RL at @GatsbyUCL and @GoogleDeepMind

London, UK
Joined October 2019
Don't wanna be here? Send us removal request.
@ted_moskovitz
Ted Moskovitz
2 years
Tired of ChatGPT screenshots? Miss the old days of watching RL agents walking around doing weird stuff? Look no further–I'm excited to share my @DeepMind internship project, where we develop a method to stabilize optimization in constrained RL. Link: 🧵
10
64
723
@ted_moskovitz
Ted Moskovitz
3 months
I joined the multimodal team at @AnthropicAI this week—really excited to build some cool stuff!
18
2
218
@ted_moskovitz
Ted Moskovitz
4 years
(1/2) I’ve uploaded notes I took for the Gatsby theoretical neuro course here: . I hope they can be useful for anyone interested in theoretical neuroscience (or anyone who’s very bored in quarantine)!
4
28
166
@ted_moskovitz
Ted Moskovitz
10 months
Worried your LLM will produce too many paperclips? Simply tell it when to stop – excited to share our new preprint, where we introduce an approach based on constrained RL to avoid overoptimization for compound reward models: 1/
Tweet media one
1
21
120
@ted_moskovitz
Ted Moskovitz
2 years
In honor of Twitter’s role as Research-LinkedIn, I’m excited to say that next week I’m starting an internship at @DeepMind working with @TZahavy along with @bodonoghue85 and the rest of the Discovery team—really looking forward to it!!
8
2
115
@ted_moskovitz
Ted Moskovitz
2 years
Very excited that our work linking dual process cognition, multitask RL, and the minimum description length principle was accepted at #ICLR2023 ! OpenReview link: Very grateful for my awesome coauthors @kao_calvin , Maneesh Sahani, & Matt Botvinick
2
14
86
@ted_moskovitz
Ted Moskovitz
7 months
Really happy to say that constrained RLHF was accepted at #ICLR2024 as a spotlight! A million thanks to my coauthors @Aaditya6284 , @djstrouse , Tuomas Sandholm, @rsalakhu , @ancadianadragan , and @McaleerStephen . Looking forward to seeing everyone in Vienna!
@ted_moskovitz
Ted Moskovitz
10 months
Worried your LLM will produce too many paperclips? Simply tell it when to stop – excited to share our new preprint, where we introduce an approach based on constrained RL to avoid overoptimization for compound reward models: 1/
Tweet media one
1
21
120
2
9
74
@ted_moskovitz
Ted Moskovitz
4 years
(1/n) Ever wonder how trust regions connect to natural gradients in RL? Like the way ‘Wasserstein’ rolls off the tongue? You might like “Efficient Wasserstein Natural Gradients for Reinforcement Learning” — w/ @MichaelArbel , @fhuszar , & @ArthurGretton
1
21
65
@ted_moskovitz
Ted Moskovitz
3 years
The successor rep. (SR) *sums discounted state occupancies to speed up policy eval + learning. But in naturalistic tasks, reward may only be available upon *first access to a state: In our #ICLR2022 paper, we introduce the first-occupancy rep. (FR) 1/
1
10
51
@ted_moskovitz
Ted Moskovitz
2 months
Really excited this is out!
@AnthropicAI
Anthropic
2 months
Introducing Claude 3.5 Sonnet—our most intelligent model yet. This is the first release in our 3.5 model family. Sonnet now outperforms competitor models on key evaluations, at twice the speed of Claude 3 Opus and one-fifth the cost. Try it for free:
Tweet media one
451
2K
7K
1
1
48
@ted_moskovitz
Ted Moskovitz
1 year
Happy to say ReLOAD hopped its way into #ICML2023 —looking forward to seeing everyone in Hawaii!!
@ted_moskovitz
Ted Moskovitz
2 years
Tired of ChatGPT screenshots? Miss the old days of watching RL agents walking around doing weird stuff? Look no further–I'm excited to share my @DeepMind internship project, where we develop a method to stabilize optimization in constrained RL. Link: 🧵
10
64
723
2
3
37
@ted_moskovitz
Ted Moskovitz
3 years
Really happy to share that our paper “Towards an Understanding of Default Policies in Multitask Policy Optimization” was accepted as an oral at @AISTATS ! Paper: Code: w/ @MichaelArbel , @jparkerholder , @aldopacchiano 1/n
2
7
36
@ted_moskovitz
Ted Moskovitz
1 year
Really excited to be at #ICML2023 in Hawaii, a great place to relax, "ReLOAD," and (ofc) stand inside talking about LLMs! If you like safe, efficient RL agents that obey constraints, come by Tues. @ 5 in Hall 1 + check out our demo: Hope to see you there!
@ted_moskovitz
Ted Moskovitz
2 years
Tired of ChatGPT screenshots? Miss the old days of watching RL agents walking around doing weird stuff? Look no further–I'm excited to share my @DeepMind internship project, where we develop a method to stabilize optimization in constrained RL. Link: 🧵
10
64
723
0
7
36
@ted_moskovitz
Ted Moskovitz
4 months
Check out our new paper using neuroscience-inspired methods to study in-context learning circuits in transformers 🧠 Amazing job by @Aaditya6284 leading the project!
@Aaditya6284
Aaditya Singh
4 months
In-context learning (ICL) circuits emerge in a phase change... Excited for our new work "What needs to go right for an induction head (IH)?" We present "clamping", a method to causally intervene on dynamics, and use it to shed light on IH diversity + formation. Read on 🔎⏬
2
44
198
1
4
34
@ted_moskovitz
Ted Moskovitz
2 years
Excited to say that our #AISTATS2022 paper “Towards an Understanding of Default Policies in Multitask Policy Optimization” was given an Honorable Mention for Best Paper! If you’re interested in hearing more (or are very bored), stop by our poster tomorrow at 4:30 BST 1/
@ted_moskovitz
Ted Moskovitz
3 years
Really happy to share that our paper “Towards an Understanding of Default Policies in Multitask Policy Optimization” was accepted as an oral at @AISTATS ! Paper: Code: w/ @MichaelArbel , @jparkerholder , @aldopacchiano 1/n
2
7
36
2
8
34
@ted_moskovitz
Ted Moskovitz
4 years
happy to say our paper was accepted @iclr_conf ! we hope anyone interested in RL or optimization will find it interesting. we’ve released our implementation of WNPG (), and should have WNES out soon as well!
@ted_moskovitz
Ted Moskovitz
4 years
(1/n) Ever wonder how trust regions connect to natural gradients in RL? Like the way ‘Wasserstein’ rolls off the tongue? You might like “Efficient Wasserstein Natural Gradients for Reinforcement Learning” — w/ @MichaelArbel , @fhuszar , & @ArthurGretton
1
21
65
3
9
30
@ted_moskovitz
Ted Moskovitz
3 years
If you’re either a) interested in hierarchical learning + optimism in deep RL or b) very bored, come hang out at our #NeurIPS2021 poster for TOP this Thursday from 8:30-10am GMT! Prizes to be distributed* for anyone showing up at 3:30am EST. *not guaranteed
@ted_moskovitz
Ted Moskovitz
3 years
(1/n) Pessimism stabilizes performance in deep RL. But optimism aids exploration. Confused? Tired of seeing your agents at the bottom? Let them figure it out themselves with TOP: w/ @jparkerholder , @aldopacchiano , @MichaelArbel , and Michael Jordan
1
6
24
2
7
27
@ted_moskovitz
Ted Moskovitz
3 years
(1/n) Pessimism stabilizes performance in deep RL. But optimism aids exploration. Confused? Tired of seeing your agents at the bottom? Let them figure it out themselves with TOP: w/ @jparkerholder , @aldopacchiano , @MichaelArbel , and Michael Jordan
1
6
24
@ted_moskovitz
Ted Moskovitz
3 years
Happy to say that TOP was accepted to NeurIPS! If you’re interested in optimism and pessimism in deep RL, you should check it out. If not, if you’re desperate to add another arXiv tab to your browser that you’ll eventually close without reading, you should also check it out.
@ted_moskovitz
Ted Moskovitz
3 years
(1/n) Pessimism stabilizes performance in deep RL. But optimism aids exploration. Confused? Tired of seeing your agents at the bottom? Let them figure it out themselves with TOP: w/ @jparkerholder , @aldopacchiano , @MichaelArbel , and Michael Jordan
1
6
24
1
3
23
@ted_moskovitz
Ted Moskovitz
2 years
To guarantee that the policy will converge to a feasible solution with high reward, we use an optimistic gradient approach to derive ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent, a principled method for *last-iterate* convergence in constrained MDPs.
1
0
21
@ted_moskovitz
Ted Moskovitz
2 years
I’m incredibly grateful for my coauthors @bodonoghue85 , @vivek_veeriah , @flennerhag , Satinder Singh, and especially my awesome host @TZahavy , as well as my fellow interns @_chris_lu_ and @RobertTLange and the rest of the Discovery Team, who couldn’t have been more welcoming!
2
0
19
@ted_moskovitz
Ted Moskovitz
1 year
This is a really awesome piece of work by @TZahavy and friends -- the idea of quality-diversity as a (partial) answer to bounded rationality is deep, and the strength of the results here underscore the power of this idea. Congrats to the team!
@TZahavy
Tom Zahavy
1 year
I'm super excited to share AlphaZeroᵈᵇ, a team of diverse #AlphaZero agents that collaborate to solve #Chess puzzles and demonstrate increased creativity. Check out our paper to learn more! A quick 🧵(1/n)
Tweet media one
5
72
336
1
4
18
@ted_moskovitz
Ted Moskovitz
4 months
I couldn't make it to #ICLR2024 unfortunately, but the inimitable @McaleerStephen will be presenting our work on constrained RLHF today from 4:30-6:30 in Halle B. If that sounds rewarding, we'd love to collect your human feedback, so please stop by!
@ted_moskovitz
Ted Moskovitz
7 months
Really happy to say that constrained RLHF was accepted at #ICLR2024 as a spotlight! A million thanks to my coauthors @Aaditya6284 , @djstrouse , Tuomas Sandholm, @rsalakhu , @ancadianadragan , and @McaleerStephen . Looking forward to seeing everyone in Vienna!
2
9
74
0
3
17
@ted_moskovitz
Ted Moskovitz
9 months
It turns out that training your transformer for too long can make your model worse at in-context learning 🫣. Super lucky to have been a part of this work led by @Aaditya6284 and @scychan_brains --check it out!
@Aaditya6284
Aaditya Singh
9 months
Training your transformer for longer to get better performance? Be careful! We find that emergent in-context learning of transformers disappears in "The Transient Nature of In-Context Learning in Transformers" (, poster at #NeurIPS2023 ). Read on 🔎⏬
1
31
146
1
2
15
@ted_moskovitz
Ted Moskovitz
2 years
This is because many approaches to constrained RL cast the problem as a min-max game using a Lagrangian formulation. However, standard gradient-based optimization is prone to oscillations on such problems, only converging *on average* across training.
1
0
15
@ted_moskovitz
Ted Moskovitz
2 years
In the process, we also identified a set of particularly challenging domains and constraints in DeepMind Control Suite which we hope will serve as a benchmark--such problems are especially important as the field increasingly looks to use RL to solve real-world problems.
1
1
13
@ted_moskovitz
Ted Moskovitz
4 years
Amortized inference? Old news, boring, lame. Amortized ~learning~? Fresh, fun, all the cool kids are doing it. Come by our virtual poster @icmlconf next week to find out why
@liwenliang
Kevin Li
4 years
We can amortise inference, but can we amortise *learning*? Yes! It's simple: only needs ELBO and least-squares regression. Joint work with @ted_moskovitz Heishiro Kanagawa and Maneesh Sahani paper: video: 1/7
Tweet media one
6
37
235
0
1
13
@ted_moskovitz
Ted Moskovitz
2 years
I was very lucky to get a sneak peak at AdA as an intern--really amazing stuff and congrats to the team!
@FeryalMP
Feryal
2 years
I’m super excited to share our work on AdA: An Adaptive Agent capable of hypothesis-driven exploration which solves challenging unseen tasks with just a handful of experience, at a similar timescale to humans. See the thread for more details 👇 [1/N]
25
266
1K
0
0
13
@ted_moskovitz
Ted Moskovitz
2 years
Say you want your agent to walk while constraining its height below some threshold–you may find that the agent oscillates between either lying down (satisfying the constraint) or walking normally (violating it but maximizing reward).
2
0
13
@ted_moskovitz
Ted Moskovitz
9 months
If you're looking to do a PhD and are interested in sequential decision-making + RL, @aldopacchiano is incredibly smart and amazing to work with--this is a great opportunity!
@aldopacchiano
Aldo Pacchiano
9 months
(1/2) In 2024 I will be joining Boston University as an Assistant Professor in Computing and Data Sciences (CDS). Seeking Ph.D. students passionate about sequential decision making, reinforcement learning, and/or algorithmic fairness.
10
33
272
0
1
13
@ted_moskovitz
Ted Moskovitz
2 years
Really lucky to have been a part of this! I think it’s a nice example of leveraging a specific (but relevant) form of shared structure among tasks to reduce the effective problem size and learn more quickly. Check it out! 👇🏻
@aldopacchiano
Aldo Pacchiano
2 years
(1/2) In this very preliminary work we introduce a model for an important set of transfer RL problems based on the concept of Undo Maps. We propose a distribution matching algorithm to solve transfer RL problems that can be modeled in this formalism.
1
4
30
0
4
11
@ted_moskovitz
Ted Moskovitz
2 years
Feel very lucky to have been included in this awesome work, excited to see that it’s finally out!
@StephensonJones
M Stephenson-Jones
2 years
🚨First preprint from the lab!! All credit goes to @cesca_gst & @HernyMV who co-led this project. We show that movement-related striatal dopamine signals encode an action prediction error, a value free teaching signal. 🧵👇 1/25
6
59
208
0
1
10
@ted_moskovitz
Ted Moskovitz
2 years
We prove the convergence of ReLOAD’s final policy in the convex case and demonstrate its practical performance on a range of discrete and continuous constrained tasks.
2
0
10
@ted_moskovitz
Ted Moskovitz
4 months
@ArmenAgha This is small scale (and I still need to implement support for autoregressive sampling), but I have a simple implementation working on top of @karpathy 's nanoGPT. I've only tested it on Shakespeare, but it seems to reproduce the reported isoFLOP pattern:
Tweet media one
0
3
10
@ted_moskovitz
Ted Moskovitz
8 months
Happening now in Hall B!
Tweet media one
@Aaditya6284
Aaditya Singh
8 months
Excited to be at #NeurIPS2023 this week to present this work (poster #812 ) at Poster Session 3, Wednesday 10:45-12:45. Looking forward to seeing new and old faces! DM if interested in chatting about LLMs, ICL, learning dynamics, mech. interp., data, reasoning, concepts. 😀
2
6
24
0
1
10
@ted_moskovitz
Ted Moskovitz
4 years
Was very lucky to work on this—if you like generative models but have trust issues with approximate inference, this could be the approach for you!
@liwenliang
Kevin Li
4 years
Can't tighten the ELBO bound with approximate inference? Check out Amortised Learning by Wake-Sleep to maximise likelihood without inference. .
0
9
48
0
1
9
@ted_moskovitz
Ted Moskovitz
10 months
One additional downside of using fixed weights between RMs is that tuning these weights is contingent on a specific training duration. By continually updating the Lagrange multipliers to avoid overoptimization, constrained RLHF is more robust to varying training times 6/
Tweet media one
1
0
9
@ted_moskovitz
Ted Moskovitz
2 years
A million thanks to my coauthors @MichaelArbel , @jparkerholder , and @aldopacchiano –this was a serious team effort and they’ve all been amazing to work with 2/
1
0
9
@ted_moskovitz
Ted Moskovitz
3 years
I’m infinitely grateful to my co-authors @tweet_dispencer and Maneesh Sahani. We're really excited about this, and we think there's a lot of potential for cool follow-ups, including more direct neuroscience connections. Check out the code here: 7/7
1
0
8
@ted_moskovitz
Ted Moskovitz
1 year
This project is a really cool example of a simple (but great) idea meeting really impressive engineering—congrats to Chris and the team!
@_chris_lu_
Chris Lu
1 year
Blazingly-fast sequence models enable Meta-RL agents that generalise across a wide range of different tasks! I'm excited to share my @DeepMind internship project, where we look at applying recent advancements in SSM’s to in-context RL! Link: 🧵
7
31
214
1
1
8
@ted_moskovitz
Ted Moskovitz
10 months
Overoptimization is especially challenging for compound RMs because the component RMs are typically correlated, which affects the values at which each RM stops being a good proxy for evaluation ratings—we term these values “proxy points” 3/
Tweet media one
1
0
8
@ted_moskovitz
Ted Moskovitz
10 months
Really excited about this approach and the usefulness of constrained optimization for LLM finetuning, and I owe enormous thanks to my awesome coauthors @Aaditya6284 , @djstrouse , Tuomas Sandholm, @rsalakhu , @ancadianadragan , + @McaleerStephen Code: 8/8
0
0
7
@ted_moskovitz
Ted Moskovitz
10 months
Compound reward models (RMs) combine multiple RMs which each reflect a different aspect of text quality, but there are 2 challenges: 1) determining the weighting for each RM 2) overoptimization 2/
Tweet media one
1
0
7
@ted_moskovitz
Ted Moskovitz
10 months
Both standard PPO + the constrained algs above need multiple runs to either tune fixed RM weightings or predict proxy pts – we also introduce an approach which uses a deriv-free optimizer to tune proxy pt predictions over the course of a single run, saving a bunch of compute 7/
Tweet media one
1
0
7
@ted_moskovitz
Ted Moskovitz
10 months
The approach we propose is fundamentally simple: 1) predict these proxy points 2) use the proxy points as constraint thresholds in a constrained RL objective to avoid overoptimization We propose and test several such objectives 4/
Tweet media one
1
0
7
@ted_moskovitz
Ted Moskovitz
10 months
We found that using constraints results in better evaluation performance and reduces overoptimization. We use Lagrangian relaxation, which additionally addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers 5/
Tweet media one
1
0
6
@ted_moskovitz
Ted Moskovitz
2 years
In our preprint “A Unified Theory of Dual Process Control” (w/ @kevinjmiller10 , Maneesh, & Matt) we find that MDL-C recapitulates a bunch of interesting human and animal behaviors on a range of cognitive tasks, but I’ll save that for another thread!
2
0
6
@ted_moskovitz
Ted Moskovitz
5 months
This is an awesome opportunity— @TZahavy and the team are amazing to work with, so definitely apply if you’re at all interested in these areas!
@TZahavy
Tom Zahavy
5 months
We are looking for brilliant and creative candidates with strong programming skills to join us at the Discovery team at @GoogleDeepMind 🧙 We build AI agents that discover new knowledge using RL, planning and LLMs. DM me if you have specific questions about working with us 🙏
6
33
287
1
0
6
@ted_moskovitz
Ted Moskovitz
3 years
One result that I’m particularly excited about is that the FR facilitates a type of planning which constructs provably shortest paths to goals given a set of base policies, essentially converting the base policies into a set of options with optimal termination conditions. 6/
Tweet media one
1
0
6
@ted_moskovitz
Ted Moskovitz
2 years
How can agents trade off between exploiting structure across past tasks w/o overfitting when future tasks are uncertain? We derive a policy optimization method w/ a learned regularizer trained to distill optimal behaviors across tasks which is itself regularized to be simple.
Tweet media one
1
0
5
@ted_moskovitz
Ted Moskovitz
3 years
This small change gives the FR a number of cool properties, and we demonstrate its use in a range of application areas, including exploration, unsupervised RL–where we extend the FR to continuous domains + func. approximation–planning, and modeling animal escape behavior. 5/
Tweet media one
Tweet media two
1
0
5
@ted_moskovitz
Ted Moskovitz
3 years
As an intuitive example, when planning a route to a restaurant to meet a friend, the FR is largest for a route/policy that arrives at the front door the fastest, whereas the SR is largest for the route which walks past the front door the highest number of times. 4/
Tweet media one
1
1
5
@ted_moskovitz
Ted Moskovitz
3 years
We show that in such situations, the SR overestimates the value of policies which revisit states that may no longer grant reward. To address this, we introduce the FR, a modification of the SR which measures the expected discount to the first occupancy of a state under π. 3/
Tweet media one
1
0
4
@ted_moskovitz
Ted Moskovitz
3 years
Many naturalistic tasks involve non-Markovian reward structures. E.g., in foraging, rewards are accessible in a given state only once, and finding the optimal policy with discounted rewards amounts to solving a shortest-path problem. 2/
1
0
4
@ted_moskovitz
Ted Moskovitz
2 years
Our algorithm, Minimum Description Length Control (MDL-C, pronounced “middle-cee”), hits just the right note 🎵😉 – it comes with formal guarantees and performs well in multitask settings with both discrete and continuous control tasks.
Tweet media one
1
0
4
@ted_moskovitz
Ted Moskovitz
3 years
Infinite thanks to my great collaborators @jparkerholder @aldopacchiano @MichaelArbel
0
1
4
@ted_moskovitz
Ted Moskovitz
3 years
(2/n) Recent approaches to off-policy deep RL typically emphasize either pessimistic or optimistic value estimation. But we hypothesize that different levels of optimism may be optimal for a given algorithm both across tasks and over the course of training:
Tweet media one
1
0
3
@ted_moskovitz
Ted Moskovitz
3 years
(3/n) Tactical Optimism and Pessimism (TOP) frames the choice between optimism and pessimism as a multi-armed bandit problem, with each arm corresponding to a different level of optimism.
Tweet media one
1
0
3
@ted_moskovitz
Ted Moskovitz
4 years
@aldopacchiano @pcastr @iclr_conf @MarlosCMachado @marcgbellemare @agarwl_ @kchorolab @jparkerholder @robinphysics these are both really cool--congrats! our paper extending BGRL is also going to be at @iclr_conf . I’m sure there are interesting connections to explore @MichaelArbel @fhuszar @ArthurGretton
1
1
3
@ted_moskovitz
Ted Moskovitz
3 years
@tw_killian @thegautamkamath Thanks! It’s definitely true that an important underlying theme of the paper is adaptivity in the face of task uncertainty @jparkerholder @aldopacchiano @MichaelArbel
0
0
3
@ted_moskovitz
Ted Moskovitz
4 years
(5/n) We also demonstrate, to our knowledge, the first detailed comparison between the WNG and the FNG, showing a clear, interpretable advantage in favor of the WNG on tasks where the optimal solution is deterministic.
Tweet media one
1
0
3
@ted_moskovitz
Ted Moskovitz
6 months
Really awesome work!
@Aaditya6284
Aaditya Singh
6 months
Ever wondered how your LLM splits numbers into tokens? and how that might affect performance? Check out this cool project I did with @djstrouse : Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs. Read on 🔎⏬
10
33
181
0
1
3
@ted_moskovitz
Ted Moskovitz
4 years
(2/n) Motivated by the behavior-guided RL (BGRL) framework [1], which uses the Wasserstein distance (WD) to compare policies, we leverage an efficient estimator of the Wasserstein natural gradient (WNG) [2] to make use of the geometry induced by the WD and speed optimization.
Tweet media one
1
0
3
@ted_moskovitz
Ted Moskovitz
9 months
@scychan_brains @Aaditya6284 Thank you for including me!!
0
0
1
@ted_moskovitz
Ted Moskovitz
3 years
(6/n) TOP-TD3 and TOP-RAD are able to significantly increase performance on challenging tasks over a suite of advanced baselines, raising the state-of-the-art.
Tweet media one
Tweet media two
1
1
3
@ted_moskovitz
Ted Moskovitz
1 year
0
1
1
@ted_moskovitz
Ted Moskovitz
5 years
watch out @TensorFlow and @PyTorch , your days are clearly numbered
@DataChaz
DataChazGPT (not a bot)
5 years
How cool! You can build a Deep Neural Network right in @googlesheets !! (sheet's here 👉 ). Both worksheet & hands-on @Medium article are the work of the great @bwest87 . h/t @jeremyphoward @math_rachel @fastdotai #DeepLearning #ML
4
107
382
0
1
2
@ted_moskovitz
Ted Moskovitz
3 years
We believe that there’s a lot of cool work to be done in both understanding these methods and deriving better ones. Please reach out if you’d like to discuss ideas! We're looking forward to seeing everyone in beautiful V̶a̶l̶e̶n̶c̶i̶a̶ Gathertown! 10/n
0
0
2
@ted_moskovitz
Ted Moskovitz
4 years
(2/2) (There are probably mistakes here and there so please let me know if you find any)
0
1
2
@ted_moskovitz
Ted Moskovitz
3 years
These are tricky, nuanced questions, whose answers depend heavily on the choice of regularization penalty, structural commonalities among tasks, and the space in which regularization is applied. We focus on KL-based methods, as these are the most common in the literature. 4/n
1
0
2
@ted_moskovitz
Ted Moskovitz
3 years
Regularized policy optimization (RPO) trains a control policy (π) to maximize returns while penalizing deviations in behavior from a “default” policy (π0). For one-off tasks, this is well-understood, but there is less theoretical understanding of multitask learning. 2/n
Tweet media one
1
0
2
@ted_moskovitz
Ted Moskovitz
3 years
Our paper is framed around two basic questions: 1) What properties does π0 need to have in order to improve optimization on new tasks? 2) What properties does a group of tasks need to share for a given algorithm to provide a measurable benefit? 3/n
1
0
2
@ted_moskovitz
Ted Moskovitz
3 years
@gershbrain @McAllesterDavid This strikes me as related to the use of “optimality” variables in the RL-as-inference framework as a means of deriving KL-regularized policy gradients (e.g., P(optimal|s,a) ) although also see . Could be wrong though!
0
0
2
@ted_moskovitz
Ted Moskovitz
3 years
In general, we think these results suggest an algorithm-dependent definition of “task family,” in that a group of tasks can only be considered a family from the perspective of a given algorithm if they share a form of structure identifiable by that algorithm. 9/n
1
0
2
@ted_moskovitz
Ted Moskovitz
4 years
(4/n) WNPG and WNES improve in performance and efficiency over advanced baselines on challenging tasks, especially when the optimization problem is ill-conditioned. These techniques can be combined with the regularization used in BGRL to further improve performance.
Tweet media one
1
0
2
@ted_moskovitz
Ted Moskovitz
3 years
Another interesting insight is that for methods which distill π0 from π, it’s often beneficial to generalization performance to delay updates of π0 within each task until π ~ π*. This can be seen as giving π0 a better “dataset” from which to learn. 7/n
1
0
2
@ted_moskovitz
Ted Moskovitz
3 years
(4/n) TOP samples a different level of optimism each episode, with the probability of that setting adjusted up or down based on the agent’s performance. During training, it dynamically adapts a distribution governing whether its value estimates are optimistic or pessimistic.
Tweet media one
1
0
2
@ted_moskovitz
Ted Moskovitz
2 years
@djstrouse *amazing* work!
1
0
2
@ted_moskovitz
Ted Moskovitz
3 years
From our results, we also derive a principled multitask RPO algorithm called Total Variation Policy Optimization (TVPO). We show that most KL-based algorithms can be seen as approximations to TVPO, and demonstrate TVPO’s strong performance on simple tasks. 8/n
Tweet media one
1
0
2
@ted_moskovitz
Ted Moskovitz
3 years
In other words, if each task in a family admits a π* which “agrees” with the other task π*s over some portion of the state space, then optimization can be improved. If not, then most common methods for learning π0 will fail to improve over a uniform default policy. 6/n
Tweet media one
1
0
2
@ted_moskovitz
Ted Moskovitz
3 years
To take a first step towards solutions, we derive error + convergence rate bounds based on the distance of π from the optimal policy (π*). We find that the type of structure to which most common multitask RPO methods are sensitive is similarity in the space of π*s. 5/n
1
0
2
@ted_moskovitz
Ted Moskovitz
2 years
@bodonoghue85 @TZahavy @kevinjmiller10 yes, I believe it has the same limiting behavior as phd student ascent
0
0
2
@ted_moskovitz
Ted Moskovitz
4 years
(3/n) We introduce two novel optimization techniques for RL: Wasserstein natural policy gradients (WNPG) and Wasserstein natural evolution strategies (WNES).
1
0
2
@ted_moskovitz
Ted Moskovitz
3 years
@djstrouse @bodonoghue85 @AdaptiveAgents thanks! one question we're interested in is whether the process of performance maximization leads the agent to develop differential preferences for high/low aleatoric vs. epistemic uncertainty @jparkerholder @aldopacchiano @MichaelArbel
1
0
1
@ted_moskovitz
Ted Moskovitz
2 years
@RobertTLange Thanks Rob!!
0
0
1
@ted_moskovitz
Ted Moskovitz
3 years
(5/n) We apply TOP to two different state-of-the-art methods for continuous control: TD3 [1], for state-based control on Mujoco, and RAD [2], for pixel-based control on the DeepMind Control Suite.
1
0
1
@ted_moskovitz
Ted Moskovitz
2 years
@bodonoghue85 @kevinjmiller10 would have been too mainstream
1
0
1
@ted_moskovitz
Ted Moskovitz
3 years
@djstrouse @bodonoghue85 @AdaptiveAgents @jparkerholder @aldopacchiano @MichaelArbel This is definitely probable w/in our framework--looking at the visit distribution for (s,a) pairs in which the critic distributions are far apart but both quantile sets are narrowly clustered would be one indicator; adding some form of learned risk-sensitivity would also be cool!
0
0
1
@ted_moskovitz
Ted Moskovitz
2 years
Q: How can agents trade off between exploiting structure across past tasks w/o overfitting when they're uncertain about future tasks? We derive a policy optimization method with a learned regularizer trained to distill optimal behaviors which is itself regularized to be simple.
Tweet media one
0
0
1