I've proven the strangest result of my career..
The classic idea that gradient descent's rate is best with constant stepsizes 1/L is wrong. The idea that we need stepsizes in (0,2/L) for convergence is wrong.
Periodic long steps are better, provably.
The new strangest results of my career (with Kevin Shu and Alex Wang).
Gradient descent can accelerate (in big-O!) by just periodically taking longer steps.
No momentum needed to beat O(1/T) in smooth convex opt!
Paper: [1/3]
Perfect shelf for my collection of induced matrix norm unit balls. The individual stand colors denote whether the norm can be computed in polynomial time, is NP-hard, or is an open question.
Details:
We can do better than this state-of-the-art constant stepsize guarantee with a tiny adjustment, just alternate between a long step (longer than we can guarantee descent for) and a short step.
Just alternating 2.9, 1.5, 2.9, 1.5, ... is provably 10% faster than prior best known!!
A nice mathematical puzzle🧐
If you take a 4-norm ball and cut it carefully, you will find a two-norm ball. 3D printed visual evidence below.
The puzzle: Why does this happen and how much more generally does it happen?
(This question was first posed to me by Pablo Parrilo)
I believe every optimization method should be understood both in primal and dual terms.
I was annoyed when teaching basic subgradient methods that I did not have a clear dual story to tell my students.
Cue a paper giving this missing equivalent dual story
This effect even manifests in actual numerics. In some sampled least squares problems, we see these faster rates from our theory occur perfectly
Thanks for reading! I'm still trying to get my head fully around this long-step acceleration, so I hope I've written clear enough here
My 3D printer has been working overtime this summer printing every norm ball I can think of. At the moment I have 70 interesting norm balls.
Below are perhaps the most common norms: p-norm balls with dual pairs (1,inf), (4/3,4), (3/2,3), (2,2)
Details:
Cycling through a pattern long than length two gives increasingly big payoffs! !
A cycle of length 127, periodically taking a long step of length 370, gets a rate almost 3x faster than prior best!!
The proofs use neat computer-assisted semidefinite programming; see the paper :)
In minimax optimization, first-order methods can locally converge to limit points, cycle, or (if very lucky) globally converge (plots below shows these for the proximal point method).
Some of my work on these behaviors just appeared in Math Programming ().
Since tweeting about faster gradient descent via long steps, I learned of an exceptional prior work needing credit:
Jason Altschuler's Master's thesis (advised by Pablo Parrilo) broke ground here 5 years ago! !
For smooth strongly convex optimization, he gave optimal faster
How to (optimally) design a fair/robust syllabus for optimization classes:
Below is the polyhedron of all rubrics I find reasonable, which I then maximize for each student. As a result, evaluation focuses on their individual areas of excellence.
Details:
In preparation for teaching the simplex method, I needed a Klee-Minty Cube showing exponentially many pivots all improving the LP objective value.
The path shown below is entirely increasing in the direction c=(4,10,25) and visits all 2^3 corners.
I had the privilege to talk about my recent work improving gradient descent's theory at MIT last week.
If you'd like to see my first go at presenting this work,
Thank you
@shuvo_das_gupta
for organizing (and for your numerical work that motivated mine)!
Prof. Ben Grimmer (
@prof_grimmer
) gave a virtual talk at MIT Operations Research Center last Friday about his recent work on faster gradient descent via long steps, which was featured in
@QuantaMagazine
. The YouTube link to his talk is here: !
By far my biggest 3D printing project: 49 operator norm balls!
Below are induced l_p -> l_q matrix norm balls with 7 values of p, q. The base color indicates whether computing the norm is polynomial (green), NP-Hard (red/orange) or open (white).
Details:
My adviser Jim Renegar showed in me the amazing unifying power of viewing optimization as convex cones: LPs as Nonnegative Orthant, Second-Order & Semidefinite Prog namesakes, and NP-Hardness as Copositives
Below are the models I use to teach this
Details
Doing something new today, an art piece (my first) is on exhibition at the at the Bridges conference
It's a shelf with 49 induced matrix norm balls, showing their difficult computational complexities
Details:
Full gallery:
Recent methods for nonconvex, nonsmooth Lipschitz minimization attain "Goldstein" stationarity: an average of nearby gradients is small
In a short(!) paper
@jiazhichao4
and I extend this to Lipschitz constraints: averaging nearby gradients satisfies KKT!
@octonion
I had my computer brute force search through a lot of possible patterns h. Most failed. May just be my proof technique being brittle but the patterns I give in the paper would fail to have a descent lemma if you permute the order at all. They seem very special.
The journal version of one of my results PhD just appeared! If you prove a guarantee for your minimization algorithm under strong convexity, a guarantee under just convexity (often) follows right away (no changing or perturbing the method needed).
🧵1/2
@loiseau_jc
I'm no expert there, but I think Yes.
The main previous result about long step gradient descent was from Young in 1953, who showed taking stepsizes based on the Chebyshev polynomial roots optimally accelerated convergence on quadratics. His argument was exactly about this
Just finished printing my students' final project in my "Freshman Experience in Applied Math" course! They wrapped a hexagonal grid's shadow perfectly onto a box
@JohnsHopkinsAMS
@HopkinsEngineer
@JohnsHopkins
I'll be talking this afternoon about Probably Faster Gradient Descent via Long Steps
@MOPTA_ORMS
.
(If you aren't here or can't make it, a previous recording is online at )
How hard is optimizing over a smooth and/or strongly convex set? If you only know how to check set feasibility and compute normal vectors, you get the same optimal rates as minimizing strongly convex/smooth functions
My work on this just appeared on arvix
Often seeking sparse solutions to optimization problems leads to adding nonconvex regularization. The family photo below gives some intuition for this effect from Smoothly Clipped Absolute Deviation (SCAD) regularization constraint sets {x | s(x)<p} for varied p
A nontrivial way to build norms in d dimensions: given any vector, compute a k<d dimension vector out of its top k magnitude entries and then use any k dimensional norm.
Below are the 3d norms given by all the 2d p-norms this way. (As p->infty, we just get the infty norm ball)
A while back, I 3D printed some convex cones as teaching props
( )
I'm looking for more fun ones to add to my collection, seeking recommendations :)
My current "todo" list:
completely positive, exponential,
and sum-of-squares cones (and their duals)
Just posted my newest work: How hard is it to minimize a sum of varied smooth and nonsmooth terms?
Surprisingly, the optimal rate is the sum of each term's individual settings rate!! Moreover, it is attained universally by an algorithm of Nesterov.
This area rapidly evolving!
Jason Altschuler and Pablo Parrilo posted long-step acceleration results for strongly convex problems just last Friday!
We are cracking this problem one small (or long) step at a time :)
[3/3]
If you thought p-norm balls were cool, wait til you see their matrix version...
Below are unit Schatten p-norm balls over 2x2 symmetric matrices. Common norms: p=2 is Frobenius, p=inf is Operator Norm, p=1 is Nuclear Norm capturing low-rankness
Details:
Just finished printing & distributing several Klee-Minty Cubes throughout my department for optimization classes (and fun) :)
If you'd like one, most university libraries can print them with my files below:
In preparation for teaching the simplex method, I needed a Klee-Minty Cube showing exponentially many pivots all improving the LP objective value.
The path shown below is entirely increasing in the direction c=(4,10,25) and visits all 2^3 corners.
On my way to give a seminar tomorrow at
@TippieAnalytics
. Will be talking about new projection-free "Radial" methods.
Example constraint set props are packed and ready to go :)
(Plenty of polyhedrons, spectrahedrons, norm balls, and sparsity inducing regularizers)
The key is stepsizes MUCH larger than 2/L, the classic limit!
Ours are built up from exponentially long patterns with exponentially large average stepsizes and a neat fractal shape.
(See the paper for their proper construction.) [2/3]
A two part series of papers I wrote towards the end of my PhD just appeared in Math Programming :)
My first ideas towards these papers were in Spring 2017, so they've been >6years in the making.
Part I:
Part II:
Math Programming just announced Meritorious Service Awards for "timely, judicious, and considerate reviews". I wanted to thank the many judicious and considerate referees I have had. Some day I hope to have a timely one :)
(disclosure I am on their list)
Have you ever wondered what the Max-Cut Polytope, its semidefinite outer approx, and its trigonometric inner approx look like? For 3x3 matrices, it lives in 3D and is shown below. For 4x4 matrices, it lives in 6D and some slices are shown below.
Details:
Musing on Dual Norms: the most classic pair (IMHO) is L_1 and L_inf.
2D does not do duality justice, as they look like rotations of each other. Taking a dual interchanges the roles of points and faces. Below L_1 has 8 faces and 6 points aligned with L_inf's 8 points and 6 faces.
As further reference objects littering my office...
I've 3D printed several common nonconvex regularizers:
q-norm (q<1) and SCAD regularization induce sparsity,
Schatten q-norm regularization inducing low-rankness.
Documented at:
Are you tired of your board games being stored unevenly due to their varied shapes?
Just build yourself an oppositely uneven shelf to get nice level tops when stored in exactly the right order...
@WhoAreYouMate
I have mixed intuitions (but that means its a really good research question).
Once in the neighborhood of a local min, I'd expect this speeds up convergence. Global bounds less clear.
But... accelerated momentum methods don't generalize well, so there is also precedent to fail.
Deepmath2023 will be at JHU! Please consider submitting an abstract (deadline extended to 6/15)! . We encourage submissions from diverse disciplines including Statistics, Physics, Computer science, Neuroscience, Mathematics, Psychology, and Engineering.
Printing a new family of norm balls: Function Norms!
These ones are strange; first one just finished!
Below is p=4-function unit ball ||f||_4^4=\int_{-1}^1 |f(x)|^4dx<=1
(Okay, functions give too big of a space so I am just looking at quadratics in the Legendre polynomial basis)
@chanwoopark20
I don't have a good intuition, my main source of inspiration was the BnB-PEP paper which also had these outlier long steps.
@ErnestRyu
,
@shuvo_das_gupta
's paper was a huge inspiration for my work. I love it.
My theory results couldn't match their numerics, though. The rates I
@roydanroy
@bremen79
Just doing deterministic schedules repeating the same pattern. I think the proof would naturally extend to backtracking ideas of adaptivity (check if sufficient descent came from doing the pattern, else cut the estimated smoothnes constant in half).
Other adaptive ideas would
Bumping this old tweet since its timely...
Consider grading your next optimization class via linear programming :)
It can be more flexible and forgiving than a fixed rubric. Over two years, I've yet to have a student complain (they like having their score maximized)
How to (optimally) design a fair/robust syllabus for optimization classes:
Below is the polyhedron of all rubrics I find reasonable, which I then maximize for each student. As a result, evaluation focuses on their individual areas of excellence.
Details:
If you are in Baltimore or Vancouver, this week I am giving seminars at Johns Hopkins (Thursday) and UBC (Friday) on my recent work developing new scalable projection-free optimization methods.
(If you aren't and still want to listen in, DM me and I can share zoom links.)
To give a good primal-dual story, we didn't just want primal and dual algorithm descriptions, but also primal and dual convergence rates. We can measure those as the iterates converging down to optimal and the model converging up to optimal.
I'll be giving a talk on "Radial Duality: Scalable Projection-Free Optimization" at
@iccopt2022
this afternoon. Rauch 201 at 3:50pm
#iccopt2022
#optimization
PS: Some 3D printed references will be used as figures.
@j_bertolotti
@adad8m
That's delightful!!
You might want to select p values nonuniformly to capture the duality equally. For each p>2, you could include 1/(1-1/p) as it's dual norm in the time with p<2.
The one line description:
Constrained optimization problems have another dual, given by an unconstrained gauge optimization problem.
Based on this "radial duality", we get radial optimization algorithms that avoid expensive projections and may outscale IPM and ADMM-style methods
Just wrapped up a great semester's "Freshman Experience in Applied Math" course with some fantastic Hopkins students, below showing off their creation.
Their writeup:
A 3D model viewer and their design files:
And the grand finale, the convergence theory is (mostly) symmetric as it should be! Both primal and dual convergence for the subgradient method happens at the same rate!!!!
(See the paper for the more general nonLipschitz, stochastic, proximal, functional constrained results).
Dyck paths are non-decreasing lattice walks from (1,1) to (k,k), a combinatorial classic.
Andrew Daw
@USCMarshall
recently told me about a nice polytope they index in one of his papers ()
He'll get the 3D print below at
@INFORMS2023
:)
Just printed another norm ball for my collection. Guess a formula for what norm generates this shape
Hint(?): this is called a cuboctahedron, whose dual/polar is a rhombic dodecahedron.
Latest Hopkins Engineering Magazine has an article on visualizations of difficult scientific concepts. It's a nice read; a few of my 3D printed operator norm balls appear at the end
When managing risk one often quantifies how bad the worst events are. Conditional Value at Risk (CVaR) measures the expected loss in the worst q% of events. This can be viewed as the 1-norm of q% largest magnitude entries.
This is a norm!! CVaR Unit Ball is below in 3D with q=66%
@amitra_90
@konstmish
Momentum methods still give better rates given perfect first-order oracles (in the regular convex optimization world at least).
Noisy/inexact/stochastic settings are exactly where gradient descent can pull ahead of Nesterov's accelerated method, so you are exactly pointing to a
@SAnsumali
I don't know what the best pattern is, but I was able to prove patterns of increasing length (up to t=127) get increasing gains. The alternating example is just showing gains in the simplest case of cycling through t=2 sizes.
The best one I found (surely not universally best) is
@GuilleAngeris
I would love one! I conjecture in the paper that cycle of pattern of length t should be able to at least speed things up by log(t), based on my computer certificates up to solving 129x129 SDPs
Any path I see to prove that would likely need to do away with the computer assistance
@McFunkypants
@QuantaMagazine
@parshallison
It is fantastic! Much much more pretty than the plot from my paper (but still evoking a connection, for those off the deep end, reading the study, with the same smooth curve over stairstep curve)
@XiongZeng111
Absolutely! Cutting perpedicular to any of the standard basis directions (including horizontally) gives a 4-norm one dimension lower. So the cut of a 4-norm that generates 2-norms must be different...
@thserra
It makes for a great lecture one, combining syllabus review with a natural example of optimization (that the students deeply care about).
It's also small enough to explicitly do the dual. My example grading in that writeup sows the seeds for attacking the dual grading program
How much longer does it take to 3D print a model after doubling it's size in each dimension?
(hint: printers infill interiors, as shown in the cube print below, so the whole volume need not be printed)
@colin_fraser
An intuition is hard, I needed a computer-aided proof technique, so there is some mystery resulting.
If pressed to give a gut-feeling, one is that the "bad case" for a long step is if the objective looked like "U" or check mark shape where we overshoot. If that was the case, we
Designing new algorithms minimizing problems reformulated in terms of gauges squared, we get some promising numerical results compared to standard solvers!
(Admittedly, these preliminary plots are for somewhat simple synthetic problems, realistic validation is still needed.)
The subgradient method is a classic, here we are thinking strongly convex minimization.
Repeatedly linearized (below) your objective function and then move downhill.
(The paper above also allows for proximal terms and functional constraints via switching, but lets keep it simple)
@ben_golub
Wlog x is the origin. Then the boundary point you are looking at is b(v) = v/MinkowskiGauge(v).
The Minkowski gauge is 1/R-Lipchitz continuous where R is the distance to the sets bdry. Your mapping b will inherit this except near the origin.
The dual perspective on this follows from Dual Averaging (Nesterov 2005), below specialized and extended for strongly convex settings.
At each iteration take a weighted average of our lower bounds to model the objective, minimize the model (plus regularization).
@ErnestRyu
@shuvo_das_gupta
If I may advertise your result :P ,
your solver is amazing and the software under the hood is fantastically useful and well documented! It's so good I put it in the acknowledgement at the end of my paper! !
Anyone here wanting to dive into PEP proofs should read yours!
@sjaak_blom
Just gradient descent, x_{k+1}=x_k - (h_k/L)\nabla f(x_k), like Cauchy has done for 150+ years.
No modern bells or whistles like those stochastic variants, those would be interesting to see
@damekdavis
I'm thinking it is arbitrarily slow (under one interpretation).
Pick any sequence of pos integers {a_i}. Let x_0=1 with x_{k+1}=x_k/2 whenever k\in {a_i}, else keep it constant x_{k+1}=x_k.
Picking a_i growing arbitrarily fast makes |x_i-x_{i+1}|'s convergence arbitrarily slow.
@netw0rkf10w
@konstmish
There are several papers going back to Young in 1953 that get long step speed ups for quadratic minimization. My writeup is missing this most recent one, so the next arxiv post will add it in :) . Thank you!!
The proof tools from that line don't seem to generalize to general
It's not eight times, scaling with volume, as most of the interior is not printed (that is, the n^3 coefficient is small)
It's primarily four times, scaling with surface area, as the thick boundary dominates the print time
If you are printing fractals, the answer is more fun...
@sp_monte_carlo
@typedfemale
Its a lot like Chebyshev methods, I expcet similar brittleness.
I really wanted to use the spectral perspective analysis from that line of work, but since smooth convex functiosn can have changing Hessians, I couldn't find a footing. Ended having a SDP do the hard spectral work.
The results work for nonsmooth, smooth or Holder smooth problems and under any Holder growth (generalizing strong convexity);
Using them on bundle methods and dual averaging where generalizing existing proofs get tricky gave new guarantees without substantial pain :)
🧵2/2
@realShengchao
@gabrielpeyre
The best theorem to my knowledge is the LD^2/2T rate nound for constant size one. Of course that theorem doesn't prove it's actually the best. Taylor et al conjectured in 2017 what the optimal choice between one and two would be if T is given ahead of time. I don't know of a
@timoberthold
Please steal freely (I'll also make/post a typo free copy this afternoon)!
Eventually I'll have some 3D printed and will share the source so any can get a physical copy (from their university library)
@bremen79
@damekdavis
The method takes as input the target accuracy eps, you may call that a parameter.
It also has a constant under the hood that is adapting from continual linesearching, you may call that a parameter.
I'm not inclined to call either of those parameters, so my mental lexicon says Yes
The gauge of a set is the function whose level sets are scaled copies of the set. Evaluating a set's gauge and its gradient is often cheap (comparable to checking membership and computing a normal vector).
We show the gauge squared is smooth/strongly convex when the set is!