Luke Metz Profile
Luke Metz

@Luke_Metz

9,859
Followers
1,783
Following
51
Media
322
Statuses

@openai

San Francisco, CA
Joined October 2012
Don't wanna be here? Send us removal request.
@Luke_Metz
Luke Metz
4 years
We have a new paper on learned optimizers! We used thousands of tasks (and a lot of compute 😬) to train general purpose learned optimizers that perform well on never-before-seen tasks, and can even train new versions of themselves. 1/8
16
199
977
@Luke_Metz
Luke Metz
3 years
New paper: when to use gradients DL researchers often compute derivatives though just about everything (physics simulators, optimization procedures, renderers). Sometimes these gradients are useful, other times they are not. We explore why. 1/7
13
158
794
@Luke_Metz
Luke Metz
2 years
Understanding performance of neural nets as a function of model size is getting more important as big models become expensive to train. Yet, predicting perf of a large model from smaller ones can be misleading. An experiment to show how it went wrong 👇
Tweet media one
Tweet media two
11
91
540
@Luke_Metz
Luke Metz
2 years
Tired of having to manually tune optimizers? We’re excited to release VeLO, the first hparam-free, super versatile learned optimizer that outperforms hand-designed optimizers on real world problems. It was trained on thousands of TPU months of compute. 1/N
10
92
500
@Luke_Metz
Luke Metz
3 years
New article about two of my favorite things: meta-learning and Jax! This explores the complex meta-loss landscapes which merge from unrolled optimization procedures and shows why gradient based hyperparameter tuning can be hard even for simple problems.
Tweet media one
3
78
406
@Luke_Metz
Luke Metz
8 months
OpenAI is nothing without its people.
8
19
379
@Luke_Metz
Luke Metz
2 years
Memory, compute, & perf tradeoff in learned optimizers Learned optimizers replace hand designed rules like SGD/Adam with learned functions, ie neural net which takes transformed gradients as inputs + outputs weight updates. How do we design this NN? 1/5
Tweet media one
5
48
322
@Luke_Metz
Luke Metz
6 years
Checkout our new work on Learning Unsupervised Learning Rules! Done with my amazing collaborators @niru_m @thisismyhat @jaschasd
Tweet media one
3
66
243
@Luke_Metz
Luke Metz
4 years
Excited to share our new work! We introduce a dataset of tasks for learned optimizer research. As an example application of this dataset we meta-train lists of optimizer hyper parameters that work well on a diverse set of tasks. 1/4
Tweet media one
Tweet media two
Tweet media three
3
68
236
@Luke_Metz
Luke Metz
3 years
[Micro paper] We train learned optimizers using other randomly initialized learned optimizers in an evolutionary process. This creates a positive feedback loop: learning to optimize enables optimizers to optimize themselves faster, accelerating training.
Tweet media one
Tweet media two
2
37
183
@Luke_Metz
Luke Metz
3 years
Interested in meta-learning and learned optimizers? Our team at Google Brain is hiring a full time researcher! Feel free to reach out to myself or @jaschasd for more information.
2
36
168
@Luke_Metz
Luke Metz
6 years
Check out our new work on meta-learning optimizers! We explore problems when meta-training with gradients, propose a solution using variational optimization, and train task specific learned optimizers that are 5x faster than Adam!
Tweet media one
3
38
144
@Luke_Metz
Luke Metz
8 months
❤️
@sama
Sam Altman
8 months
i love the openai team so much
5K
4K
73K
1
3
115
@Luke_Metz
Luke Metz
4 years
Really cool paper by @wojczarnecki et. al. on the geometry of games. They find many games have many strategies which form long non-transitive cycles (think rock paper scissors) but also have large skill caps where strong, transitive strategies will always win.
Tweet media one
2
20
112
@Luke_Metz
Luke Metz
3 years
Interested in computing gradients through unrolled computation graphs? Come see our paper at ICML! We construct an ES like unbiased estimator and use it for hyper-param optimization, RL, and meta-learning. Talk: Today 6pm PT(now) Poster: Today 9-11pm PT
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
29
105
@Luke_Metz
Luke Metz
6 years
Just published second robotic arm build log post talking about electronics and firmware!
Tweet media one
1
16
66
@Luke_Metz
Luke Metz
6 years
Another post on my DIY robot arm! Trying out model predictive control with a learned dynamics model to control this thing.
Tweet media one
Tweet media two
3
16
63
@Luke_Metz
Luke Metz
8 months
❤️
@ilyasut
Ilya Sutskever
8 months
I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.
7K
4K
33K
2
3
61
@Luke_Metz
Luke Metz
6 years
First post of hopefully many documenting a side project: building a robotic arm to make me tea! Designed in @Onshape and printed on a @lulzbot3D .
Tweet media one
6
13
60
@Luke_Metz
Luke Metz
4 years
In my favorite experiment, we show how general these methods are by using them to train new versions of themselves! (This is similar to self-hosting compiles -- compilers which are written in the language that they compile.) 7/8
Tweet media one
2
0
56
@Luke_Metz
Luke Metz
5 years
Our work exploring the use of learned optimizers to make more robust image models is on arXiv! We find that in some cases learned optimizers are capable of learning more robustness image classifiers!
Tweet media one
2
10
56
@Luke_Metz
Luke Metz
3 years
Thanks Daniel Freeman( @bucketofkets ), @sschoenholz and @TalKachman for the great collaboration. This was a really fun project to work on! Key takeaway: take gradients with care. Just because you can backprop doesn’t mean you should! 7/7
1
0
46
@Luke_Metz
Luke Metz
4 years
In the same way learned features took over computer vision, we believe ML algorithms will be replaced with learned components. We shift away from hand designed optimizers (SGD, Adam) to learned optimizers parameterized by neural nets and trained to optimize neural nets. 2/8
1
2
47
@Luke_Metz
Luke Metz
2 years
Persistent Evolution Strategies (PES) is one of the most useful meta-learning techniques we discovered last year. It enables truncated training (like truncated backprop) which speeds up meta-training without any of the short horizon bias that can ruin performance.
@PaulVicol
Paul Vicol
2 years
Our ICML 2021 best paper introducing Persistent Evolution Strategies is now even better with new analysis and a Colab notebook showing the method. Check it out! With @jaschasd and @Luke_Metz ! Colab: Git:
2
32
149
0
7
47
@Luke_Metz
Luke Metz
2 years
Want to experiment with a new learned optimizer architecture? We released a library for learned optimizer research. Would love to hear your feedback! This work is impossible without @bucketofkets @jmes_harrison @niru_m @jaschasd 5/5
0
4
42
@Luke_Metz
Luke Metz
5 years
Thanks all for coming to our talk! Feel free to stop by our poster: "Meta-Learning Update Rules for Unsupervised Representation Learning" at #70 at 4:30! Happy to answer any questions! Thanks to my awesome coauthors: @niru_m @thisismyhat @jaschasd #ICLR2019
Tweet media one
0
9
39
@Luke_Metz
Luke Metz
4 years
This was a really fun project to be involved in. It turns out training each layer of a neural network independently works pretty well. More importantly, it lets us scale beyond data parallelism!
@MishaLaskin
Misha Laskin
4 years
Excited to share a paper on local updates as an alternative to global backprop, co-led with @Luke_Metz + @graphcoreai @GoogleAI & @berkeley_ai . tl;dr - Local updates can improve the efficiency of training deep nets in the high-compute regime. 👉 1/N
1
22
115
0
1
35
@Luke_Metz
Luke Metz
3 years
We show that when computing a gradient through an iterative system, we need to compute terms which consist of a product of the state transition Jacobian. This product is what causes issues. If the Jacobian's eigenvalues are > 1, gradients explode. < 1, gradients vanish 😱 2/7
Tweet media one
3
1
34
@Luke_Metz
Luke Metz
2 years
We believe the resulting optimizer we trained is a generally useful tool. The pre-trained optimizers + the code to run them can be found at . All the code used to train these optimizers is also open source!
1
4
35
@Luke_Metz
Luke Metz
4 years
The resulting learned optimizer, which requires no hyper parameter tuning, outperforms modestly tuned hand design methods on the majority of our tasks. 5/8
Tweet media one
3
0
34
@Luke_Metz
Luke Metz
3 years
Getting rid of gradients leads to faster training?!? While this seems counterintuitive, consider trying to optimize a wiggly function convolved with a Gaussian. The more wiggly the function, the higher the grad variance. With blackbox/evolution, variance remains constant. 6/7
Tweet media one
2
3
34
@Luke_Metz
Luke Metz
4 years
We explore a new learned optimizer architecture: a hierarchical LSTM. It has access to both training loss and validation loss of the target task, which allows for dynamic regularization. 3/8
Tweet media one
2
1
31
@Luke_Metz
Luke Metz
4 years
We find the number of tasks we train the learned optimizer on to be critical. More tasks leads to better optimizers and we ultimately train on a dataset of ~6k tasks. 4/8
Tweet media one
1
1
31
@Luke_Metz
Luke Metz
3 years
We demonstrate exploding gradients in physics simulation, molecular dynamics, and learned optimization. In the absence of noise, the loss surface can be high curvature, causing large gradients. While averaging smooths the loss, the grad variance still grows exponentially. 3/7
Tweet media one
Tweet media two
Tweet media three
1
0
31
@Luke_Metz
Luke Metz
4 years
Thanks to my wonderful collaborators: @niru_m , @bucketofkets , @poolio , @jaschasd 🙏 8/8
2
0
29
@Luke_Metz
Luke Metz
2 years
VeLO is a learned optimizer. Instead of designing an update rule by hand as commonly done (e.g. Adam, SGD), VeLO is a tiny neural network that takes in gradient values, and outputs weight updates.
Tweet media one
1
3
29
@Luke_Metz
Luke Metz
4 years
Finally, we have open sourced the code for the tasks as well as learning curves for ~29 million models trained with different optimizers and hyper parameters. 3/4
1
10
28
@Luke_Metz
Luke Metz
4 years
We are releasing these lists of optimizer hyperparameters in TF, PyTorch, and Jax as a drop in replacement for existing optimizes. Give it a try and let us know how it goes! 2/4
Tweet media one
2
7
23
@Luke_Metz
Luke Metz
6 years
Axes 2 and 3 on my robot arm now done with 20x more powerful motors!
Tweet media one
1
2
24
@Luke_Metz
Luke Metz
2 years
Surprise: a mere 4 hidden unit MLP outperforms best hand designed optimizers! In some settings, this tiny MLP is even cheaper computationally. MLPs generally run slower than a simple Adam function, but we can make them fast with automatic kernel fusion thanks to JAX and XLA. 3/5
Tweet media one
1
3
23
@Luke_Metz
Luke Metz
4 years
On larger scale tasks, these optimizers have comparable performance to learning rate tuned adam/momentum despite never seeing similar tasks at outer-training time. For example, below is a small ResNet on CIFAR-10. 6/8
Tweet media one
1
0
23
@Luke_Metz
Luke Metz
2 years
To train the weights of VeLO, we apply it to around 17 billion different small scale tasks, using (approx) gradient descent to find the lowest loss across all these tasks. This takes around 40 days with as much compute as we could get our hands on scattered across the globe.
Tweet media one
1
0
23
@Luke_Metz
Luke Metz
2 years
Finally I want to thank all my amazing collaborators! In particular, @jmes_harrison , @bucketofkets and @jaschasd ! This has been a long project and I am glad to finally release it to the world!
2
0
23
@Luke_Metz
Luke Metz
3 years
I am also thrilled to say this paper got the ICML Outstanding Paper Award! Thanks @icmlconf ! Huge props to @PaulVicol who lead this work as well as our other collaborator @jaschasd .
2
0
23
@Luke_Metz
Luke Metz
2 years
Depending on how you compute an average, VeLO is 11x faster than tuned Adam, on average -- way faster than every other existing optimizer we tested!
Tweet media one
1
0
21
@Luke_Metz
Luke Metz
3 years
So what to do about this? A few approaches. One is to use truncated backprop through time. While this somewhat works, it’s finicky and less performant than simply training *without* gradients. 5/7
Tweet media one
1
0
20
@Luke_Metz
Luke Metz
6 years
Upgrading mechanical design for my robot arm. New base finally done: . Next 4 axes coming soon!
Tweet media one
1
2
20
@Luke_Metz
Luke Metz
2 years
2. VeLO also learned to pick different step sizes for different layers of a neural network as well.
Tweet media one
1
3
19
@Luke_Metz
Luke Metz
2 years
VeLO is much more versatile than previous learned optimizers. Once trained, it can be applied to optimize just about any neural network. How do we know? In general, evaluating optimizers is hard. We spend the majority of the paper analyzing different views of this question.
1
0
19
@Luke_Metz
Luke Metz
5 years
Started making a home made motion capture system to control my robot arm! This post contains a bit of hardware + some prototype software which treats the problem as probabilistic inference!
Tweet media one
Tweet media two
0
1
18
@Luke_Metz
Luke Metz
3 years
Looking at the Jacobian of 2 different initializations, we see that the stable initializations have smaller eigenvalues, and thus more controlled gradients. 4/7
Tweet media one
1
0
18
@Luke_Metz
Luke Metz
5 years
I will presenting our work "Understanding and correcting pathologies in the training of learned optimizers" today at 11-11:20 in HallA with a poster 6:30-9:00! Come by and say hi! #ICML2019
Tweet media one
1
3
17
@Luke_Metz
Luke Metz
2 years
Even though VeLO was only trained on small scale tasks, it works out of box on much larger scale tasks. It helps models converge faster than tuned Adam on a variety of image and speech tasks from the MLCommons without any tuning required.
Tweet media one
1
1
17
@Luke_Metz
Luke Metz
2 years
How does VeLO do this? 4 things that VeLO has learned to do. 1. VeLO learned to take into account the length of training when it optimizes. Depending on how long we tell the optimizer to train for, it will achieve lower losses at different points.
Tweet media one
1
0
17
@Luke_Metz
Luke Metz
9 years
Koi fish word embeddings tsne!
Tweet media one
1
6
16
@Luke_Metz
Luke Metz
5 years
Update on the electronics for my DIY robotic arm! Now with a custom PCB!
Tweet media one
Tweet media two
Tweet media three
1
1
16
@Luke_Metz
Luke Metz
2 years
3. It can make use of much larger batches than existing methods without seeing degradation in performance.
Tweet media one
1
0
14
@Luke_Metz
Luke Metz
7 months
New gradient estimation technique lead by the fantastic @OscarLi101 ! It provides low variance estimates gradients of unrolled, or iterative computation graphs such as those found in rl, learned optimizers, meta optimization. If you’re at NeuRIPS go check out the poster!
@OscarLi101
Oscar Li
7 months
Introducing: Noise-Reuse Evolution Strategies, an unbiased, online, memory efficient, variance-reduced gradient estimator that can outperform many other methods (including Backprop) on some particularly challenging unrolled computation graph problems. A Thread🧵
1
0
15
1
1
14
@Luke_Metz
Luke Metz
2 years
4. VeLO can also automatically adapt the step sizes it takes depending on the size of the model being trained.
Tweet media one
1
0
13
@Luke_Metz
Luke Metz
2 years
VeLO can work widely out of distribution too, providing speedups over tuned baselines in computational physics domains, game playing decision transformers, NERF, and object detection, distillation, and more (without any tuning).
Tweet media one
Tweet media two
Tweet media three
1
0
13
@Luke_Metz
Luke Metz
3 years
This was a really fun project combining learned optimizers and physics led by @amilmerchant ! We train learned optimizers that learn to hop around multi-minimum loss surfaces which arise when optimizing atom positions to minimize energy.
@ekindogus
Ekin Dogus Cubuk
3 years
New paper on ML & physics at ICML! Learn2Hop: Learned Optimization on Rough Landscapes With Applications to Atomic Structural Optimization We adapt learned optimizers for atomic structural optimization, and compare to baselines from physics. abs:
Tweet media one
3
16
94
0
0
13
@Luke_Metz
Luke Metz
2 years
In a smaller scale setting, where we can run many more experiments, VeLO also performs well. On a benchmark task suite of 83 problems, VeLO outperforms learning rate tuned Adam on ALL tasks, and is 2x faster on over half of them! (without any tuning)
Tweet media one
1
1
11
@Luke_Metz
Luke Metz
3 years
We believe positive feedback loops such as this represent a promising way to achieve more intelligent systems. Thanks to my wonderful collaborators, @bucketofkets , @niru_m , and @jaschasd for working with me on this!
1
1
11
@Luke_Metz
Luke Metz
8 years
Come check out our poster on Unrolled Generative Adversarial Networks at @NipsConference Workshop on Adversarial Training!
Tweet media one
0
5
11
@Luke_Metz
Luke Metz
4 years
Thanks to my awesome collaborators @niru_m , @ruoxi_cc , Daniel Freeman ( @bucketofkets ), @poolio @jaschasd ! 4/4
1
1
10
@Luke_Metz
Luke Metz
2 years
So far we’ve trained & evaluated learned optimizers on a single task. Can we use learned optimizers on tasks never before seen? Likely, but it depends on a lot of factors, including the learned optimizer’s model architecture and especially the random seed. 4/5
Tweet media one
1
0
8
@Luke_Metz
Luke Metz
8 years
Samples from cifar10 birds subset using my cppn gan rendered at 4x resolution. Still under fitting horribly :(.
Tweet media one
2
2
7
@Luke_Metz
Luke Metz
8 years
Playing around with 1D GAN
1
1
7
@Luke_Metz
Luke Metz
2 years
When designing optimizers, one must balance optimization performance with memory & compute overhead. E.g. Adam requires more memory but performs better than SGD. We explore this tradeoff across different features for learned optimizers. 2/5
Tweet media one
1
0
7
@Luke_Metz
Luke Metz
6 years
We also explore meta-training against validation loss, creating learned optimizers that generalize better than existing hand designed methods! Thanks to my amazing collaborators @niru_m @JvNixon Daniel Freeman, @jaschasd !
0
0
6
@Luke_Metz
Luke Metz
4 years
They show this on real games ranging form Tic Tac Toe, to Go, to Starcraft!
Tweet media one
1
0
5
@Luke_Metz
Luke Metz
2 years
VeLO is a learned optimizer. Instead of designing an update rule by hand as commonly done (e.g. Adam, SGD), VeLO is a tiny neural network that takes in gradient values, and outputs weight updates.
Tweet media one
0
0
5
@Luke_Metz
Luke Metz
8 years
Some mnist samples rendered at 4x resolution from a CPPN-GAN. Thanks @hardmaru for the idea!
Tweet media one
0
0
4
@Luke_Metz
Luke Metz
6 years
@Knix01133550 @hardmaru The figures above are on a test task, from the same task distribution. Both the train and validation data was never seen while training the optimizer. Additionally, at test time the optimizer only has access to the training data from this task, not the validation data. 1/5
1
0
4
@Luke_Metz
Luke Metz
4 years
Madison is awesome! I am looking forward to reading more of his posts.
@pragmaticml
Madison May (e/ia)
4 years
Oops, I started a ML blog: First posts are on JAX, Einsum Notation, and Visualizing ROC Curves. Next up is a survey of methods for attending to long seqs, and a detailed dive into the Reformer architecture!
3
48
267
1
0
3
@Luke_Metz
Luke Metz
8 years
They kinda look like cifar10 samples right? (cppn-gan rendered at 128x128, each quad is a different time / model)
Tweet media one
1
1
3
@Luke_Metz
Luke Metz
5 years
This was with my awesome collaborators: @niru_m , @JvNixon , @bucketofkets , and @jaschasd !
0
0
3
@Luke_Metz
Luke Metz
4 years
@nikoliazekter Understanding how these optimizers work is of great interest to us! We have explored a bit more than whats in the paper but found the behavior of these optimizers are complex and hard to pin down as there are so many moving pieces. More soon hopefully!
0
0
3
@Luke_Metz
Luke Metz
4 years
@TheGradient In this work we focus on the mean test loss over the course of training. One cool thing about learned optimizers though is we are free to pick which measure of success we care about and train an optimizer to specifically target it.
0
0
2
@Luke_Metz
Luke Metz
4 years
@hardmaru This book is great! I highly recommend!
0
0
2
@Luke_Metz
Luke Metz
4 years
@CianEastwood This comparison is not quite accurate because we found having some communication across parameter, and across layer is useful for optimization performance. This is where the per-tensor LSTM comes in. Make sense? 3/3
1
0
2
@Luke_Metz
Luke Metz
6 years
@JeffDean @Onshape @lulzbot3D Haha we will see. It needs to move first before being called too :).
0
0
2
@Luke_Metz
Luke Metz
3 years
0
0
2
@Luke_Metz
Luke Metz
3 years
@mat_kelcey So much better. Thanks for the tip!
0
0
1
@Luke_Metz
Luke Metz
6 years
@mat_kelcey @hardmaru @Onshape @lulzbot3D There is actually a little bit of of RL going on already! Model based, and at a lower level (move joint here) as opposed to at a task level.
0
0
2
@Luke_Metz
Luke Metz
3 years
@pabbeel @IEEEorg This is great news! Congratulations!
0
0
2
@Luke_Metz
Luke Metz
6 years
@soumithchintala That is craaaazy cheap -- CUDA + a 4.5 kg payload!?! Seems like a much better platform than most hobby robotics platforms around. I too am curious to see how this develops.
0
0
1
@Luke_Metz
Luke Metz
4 years
@SussilloDavid This looks awesome! Congrats! I am very excited to see what comes out of this project.
1
0
1
@Luke_Metz
Luke Metz
2 years
@Pranjal_d_vyas 1. Not for this task. This probably changes how things scale and will surely result in better perf. 2. I would expect that this also matters too / behavior will change with scale in ways that are related to the particular dataset being used.
1
0
1
@Luke_Metz
Luke Metz
6 years
@colinraffel @Onshape @lulzbot3D Good to know, thanks! Lets see how many concessions I have to make to make this actually work.
0
0
1
@Luke_Metz
Luke Metz
4 years
@pragmaticml Ya I am starting too more and more! Jax is really great in my experience and only getting better!
0
0
1
@Luke_Metz
Luke Metz
5 years
@colinraffel Congrats!!!
1
0
1
@Luke_Metz
Luke Metz
6 years
@jasonwebb @Onshape @lulzbot3D Thanks! First step is to make it joint angle controlled (as opposed to torque), and depending on how well that goes then start to think about higher level controls. Firmware thus far is a few hundred lines of C -- receive packets over serial and execute. Update post coming soon!
1
0
1
@Luke_Metz
Luke Metz
8 years
@vintermann Yep! It was those figures that showed differences / weaknesses that made me want to see it in 1D.
0
0
0
@Luke_Metz
Luke Metz
4 years
@josephdviviano Thanks! Hmm, it seems to have been removed from master for some reason. I found a copy here though:
0
0
1