Jeremy Cohen Profile Banner
Jeremy Cohen Profile
Jeremy Cohen

@deepcohen

3,843
Followers
894
Following
88
Media
1,070
Statuses

PhD student in machine learning at Carnegie Mellon. The goal of my research is to turn deep learning into a real engineering discipline.

Pittsburgh, PA
Joined September 2011
Don't wanna be here? Send us removal request.
Pinned Tweet
@deepcohen
Jeremy Cohen
2 years
This amazing paper by Alex Damian, Eshaan Nichani, and Jason Lee is a critical contribution to deep learning theory and optimization theory. The paper contains the key to understanding many surprising aspects of neural net training dynamics, including
6
63
425
@deepcohen
Jeremy Cohen
1 year
Today I learned that Noam Shazeer, who authored a bunch of the important Google NLP papers, was an IMO gold medalist
Tweet media one
11
46
731
@deepcohen
Jeremy Cohen
4 years
Our new ICLR paper demonstrates that when neural networks are trained using *full-batch* gradient descent: (1) the training dynamics obey a surprisingly simple "story", and (2) this story contradicts a lot of conventional wisdom in optimization.
10
109
679
@deepcohen
Jeremy Cohen
9 months
There is a discussion on ML reddit about a paper which shows loss curves for different models that seem to "line up." People were saying that it looks suspicious. Actually, in my experience, this is a real thing that happens when the data batch ordering is the same.
Tweet media one
31
33
475
@deepcohen
Jeremy Cohen
5 years
If you thought that deep learning was a crazily active research field:
Tweet media one
3
56
382
@deepcohen
Jeremy Cohen
3 years
At the CMU ML PhD retreat, we played a trivia round of “machine learning puns.” The solution to each of the below questions is a machine learning term that one might see e.g. on a cheat sheet for an intro ML course. Answers are in the next tweet.
Tweet media one
Tweet media two
4
49
375
@deepcohen
Jeremy Cohen
2 years
The “theory-vs.-practice spectrum” is not a helpful concept. I think that a better mental model is a 2-dimensional rigor/reality plane. The goal is to always be on the Pareto frontier of rigor and reality. For algorithms papers: if someone else has another algorithm that is …
5
56
376
@deepcohen
Jeremy Cohen
2 years
I can’t empathize at all with the take that LLM-assisted writing of academic research is bad, much less a form of “plagiarism.”Research isn’t schoolwork, and it’s not a zero sum game between researchers; it’s a contest between humanity and Nature - all of us are on the same side.
5
31
322
@deepcohen
Jeremy Cohen
2 years
Everyone interested in DL optimization should read the preface to Yurii Nesterov's famous textbook. It was only *after* theoretical analyses became predictive of experimental results that it became common, and then obligatory, for scientific papers about optimization to prove
Tweet media one
9
45
321
@deepcohen
Jeremy Cohen
4 years
@wfithian @RandomlyWalking In a WSJ op-ed last month, these same two guys used "fraction of NBA players who tested positive for COVID by March 19" as an estimate for "fraction of the COVID-positive population in cities with NBA teams." Craziest thing I've ever read.
Tweet media one
10
26
254
@deepcohen
Jeremy Cohen
2 years
I tell new PhD students to pick a research topic according to three criteria: (1) the problem should be important, (2) it should have a reasonable chance of being solvable, and (3) you should personally have a unique edge.
@nntaleb
Nassim Nicholas Taleb
2 years
The only writing advice I've ever given: write the book that nobody else can write. If there is a single person on Planet Earth who can write anything close to it, find a hobby. Generalize to every line you write. Those who didn't follow such a guideline are punished by ChatGPT.
98
412
3K
6
44
242
@deepcohen
Jeremy Cohen
2 years
This is an interesting new paper on the topic of feature learning. They show that several aspects of feature learning in neural nets can be reproduced in a simpler kernel-based model dubbed a “Recursive Feature Machine” (RFM)
6
40
230
@deepcohen
Jeremy Cohen
2 years
Find someone who looks at you the way NeurIPS reviewers look at “\begin{theorem}”.
7
6
228
@deepcohen
Jeremy Cohen
6 years
1/ I'm excited to share our work on randomized smoothing, a PROVABLE adversarial defense in L2 norm which works on ImageNet! We achieve a *provable* top-1 accuracy of 49% in the face of adversarial perturbations with L2 norm less than 0.5 (=127/255).
2
60
209
@deepcohen
Jeremy Cohen
4 years
My take is that NeurIPS/ICML are ML methods venues, and application papers are better off being published in the literature of the relevant field. The real problem is that ML methods work is seen by CS depts as more prestigious than ML applications work.
5
30
215
@deepcohen
Jeremy Cohen
2 years
There is a gaping hole in the literature regarding the purpose of weight decay in deep learning. Nobody knows what weight decay does! AFAIK, the last comprehensive look at weight decay was this 2019 paper , which argued that weight decay
@cosminnegruseri
Cosmin Negruseri
2 years
@zacharynado this is great! curious what do you think about adamW, seems like a critical component for chinchilla, but you don't mention it in the guide
2
0
1
9
19
178
@deepcohen
Jeremy Cohen
11 months
I think the reason why second-order methods keep underperforming relative to first-order methods in deep learning is that first-order methods are more powerful than the theory gives them credit for. First-order methods + large step sizes can implicitly access specific **third
@yaroslavvb
Yaroslav Bulatov
11 months
People tried to make 2nd-order methods practical for NN's, without obvious success. The obstacle is likely the "stochastic" part and not the "non-linear part", as similar thing happened in the linear setting. Consider works of Nocedal 10 years ago, Schraudolph 20 years ago
7
19
139
3
19
158
@deepcohen
Jeremy Cohen
1 year
This fascinating ICML '22 paper argues that the practical effectiveness of the KFAC optimizer is unrelated to its original motivation as approximating the Fisher information, and is instead due to an implicit effect stemming from a damping heuristic:
3
20
131
@deepcohen
Jeremy Cohen
1 year
Our lack of basic knowledge about DL optimizer dynamics is holding up the whole rest of the field. The optimizer is the lowest level of the deep learning 'stack' - all DL happens through an optimizer. Until we understand optimization, it will always be a confounder.
@yoavgo
(((ل()(ل() 'yoav))))👾
1 year
@haldaume3 maybe if you used better features? did you tune your learning rate? which optimizer were you using, what if you tried adam? oh you used bert-base and not roberta-large? its the same story all the time
3
1
40
3
6
123
@deepcohen
Jeremy Cohen
4 years
@seanjtaylor This book, written by the inventor of A-star search, is a freakishly thorough history of the whole field of AI: .
2
29
120
@deepcohen
Jeremy Cohen
4 years
Square loss works basically as well as cross-entropy loss on classification tasks. For example, square loss gets 76.0 accuracy for ResNet-50 on ImageNet, compared to 76.1 for cross-entropy.
@arxiv_cs_LG
cs.LG Papers
4 years
Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks. Like Hui and Mikhail Belkin
1
6
18
7
16
112
@deepcohen
Jeremy Cohen
5 years
If anyone out there is chasing citations, you could write “Convolutional Neural Networks for Coronavirus Detection” in four hours tonight, and six months from now it will be the most cited paper in the history of CS.LG.
4
5
89
@deepcohen
Jeremy Cohen
3 years
People learning about NTK are often confused by the following apparent paradox: in the NTK regime, the last-layer feature kernel (and the NTK) do not evolve in the infinite-width limit, yet somehow the network still fits the training dataset. How can the network fit the ....
2
7
87
@deepcohen
Jeremy Cohen
2 years
I'll be at NeurIPS from Tuesday - Saturday. I'd love to chat with anyone interested in neural net training dynamics, neural net optimization theory, neural net initialization, or related topics. Happy to explain "edge of stability" to anyone interested. DM me to schedule!
1
6
79
@deepcohen
Jeremy Cohen
3 years
The theory of optimization for deep learning is still in the luminiferous aether phase. If you work in theory and want to know what gradient descent *actually does* when it trains neural networks, our paper was written for you:
@tomgoldsteincs
Tom Goldstein
3 years
My recent talk at the NSF town hall focused on the history of the AI winters, how the ML community became "anti-science," and whether the rejection of science will cause a winter for ML theory. I'll summarize these issues below...🧵
28
217
924
3
11
73
@deepcohen
Jeremy Cohen
1 year
It's great to see Dr. Chernoff beating the tail bounds.
@stat110
Joe Blitzstein
1 year
My colleague Herman Chernoff is turning 100 this year! We're having a centennial celebration on May 5 at Harvard: Hope to see a lot of you there!
1
12
74
1
6
76
@deepcohen
Jeremy Cohen
2 years
I’ve seen people justify LLMs along the lines of “the LLM will only be used for editing, not for writing the paper.” IMO, this is besides the point. I don’t care if an LLM came up with the research idea, ran the experiments, and proved the theorems! If someone has figured
@deepcohen
Jeremy Cohen
2 years
I can’t empathize at all with the take that LLM-assisted writing of academic research is bad, much less a form of “plagiarism.”Research isn’t schoolwork, and it’s not a zero sum game between researchers; it’s a contest between humanity and Nature - all of us are on the same side.
5
31
322
1
7
65
@deepcohen
Jeremy Cohen
3 years
Answers:
Tweet media one
Tweet media two
3
1
62
@deepcohen
Jeremy Cohen
9 months
The first time I saw this, I assumed I either had a bug or was going crazy. But then I kept noticing it. Comparing notes with @alex_damian_ , he said he independently noticed the same thing. So it seems to be a legit phenomenon (though one that is still unexplained).
5
0
64
@deepcohen
Jeremy Cohen
1 year
A lot of peer review angst originates from the unfortunate custom in the ML literature that all ML papers -- even deep learning ones -- are to be written under the absurd pretense that theory and practice co-exist harmoniously today.
@minimario1729
Alex Gu
1 year
@thegautamkamath yeah, i basically said the contents of my tweet in my review. another reviewer disagreed, they said this is neurips and not a math journal so experiments are needed😆 interesting to see the differences, lots of opinions and healthy discussions on this paper all around!
1
0
8
3
5
63
@deepcohen
Jeremy Cohen
2 years
@ethanCaballero I would recommend the recipe "preLN + residual connections": x_{L+1} = MLP(LN(x_L)) + x_L. This recipe is supposed to have magical trainability properties - see section 5 in this paper: . (The theory in this paper is not BS.)
3
3
59
@deepcohen
Jeremy Cohen
2 years
I've noticed that a lot of people have the misconception that infinitely-wide neural nets only train as kernels if the learning rate is tiny. (Maybe b/c the NTK paper only studied gradient flow.). In fact, you only need the learning rate to be less than 1 / [max NTK eigenvalue]
1
1
57
@deepcohen
Jeremy Cohen
2 years
As noted in Remark 4 of this preprint, in neural network optimization, a gradient Lipschitz assumption (L-smoothness) is invalid, while a Hessian Lipschitz assumption might be valid. Why would the first and second derivatives have such different regularity properties? Well …
1
8
54
@deepcohen
Jeremy Cohen
2 years
One reason why the theory/practice dichotomy is unhelpful is that every theorist wants their work to be useful, and, similarly, every applied researcher wants their work to be mathematically principled. Nobody ever says “my career goal is to prove irrelevant theorems” or
3
2
50
@deepcohen
Jeremy Cohen
1 year
banning kernel regression with the kernel k(x, y) = π − cos⁻¹(x, y) because it is equivalent to a neural network with an infinite number of parameters
@norabelrose
Nora Belrose
1 year
I'm opposed to any AI regulation based on absolute capability thresholds, as opposed to indexing to some fraction of state-of-the-art capabilities. The Center for AI Policy is proposing thresholds which already include open source Llama 2 (7B). This is ridiculous.
Tweet media one
56
37
406
0
2
50
@deepcohen
Jeremy Cohen
1 year
If you're doing Taylor expansions on neural nets and need to replace ReLU with a smooth activation, consider the one proposed here: . It's closer to ReLU than are GeLU/ELU/softplus, and it can be used as a drop-in replacement for ReLU on SOTA archs.
1
8
48
@deepcohen
Jeremy Cohen
9 months
I’m giving a talk at 2:40pm at the Heavy Tails workshop on Friday. The talk is about DL optimization dynamics in general, and adaptive gradient methods in particular. I’m also around NeurIPS starting Wednesday, and would love to meet people with common interests - send me a DM!
3
2
49
@deepcohen
Jeremy Cohen
1 year
There's a funny "Where are they now?" post on the sketchy econ forum EJMR in which some economist declares that Noam went on to have a successful life because ... he subsequently scored well in the Putnam competition in college.
Tweet media one
2
1
49
@deepcohen
Jeremy Cohen
2 years
they are not spending their money in order to set up a “who is the best computer scientist” contest. They are looking for results!
1
0
48
@deepcohen
Jeremy Cohen
2 years
@beenwrekt @KameronDHarris I'd look at these: - - - -
2
2
48
@deepcohen
Jeremy Cohen
2 years
out how to do this, my only reaction is that I would like them to repeat this process on the 30 hard open problems in deep learning theory that are blockers for the research I want to do. When the US Congress or Google allocates funding for CS research over other priorities,
3
1
45
@deepcohen
Jeremy Cohen
2 years
I would love if the other people in my field of ML used LLMs to write clearer or otherwise better papers!
1
1
44
@deepcohen
Jeremy Cohen
2 years
“my career goal is to publish unprincipled hacks that get SOTA.”
3
1
41
@deepcohen
Jeremy Cohen
1 year
Someone should also give a talk “the uneasy relationship between deep learning and computer science,” about whether the methodology of computer science is the right methodology for studying deep learning.
@boazbaraktcs
Boaz Barak
1 year
Looking forward to talking in Sapienza University of Rome next week on "The uneasy relation between deep learning and statistics" Spent an unhealthy amount of time on the images for the title slide 😀
Tweet media one
7
33
326
1
1
41
@deepcohen
Jeremy Cohen
1 year
“Backpropagation through time” (It’s autodiff for RNNs)
@moyix
Brendan Dolan-Gavitt
1 year
What are some fancy sounding math terms that turn out to be something incredibly simple? Two I know of: - Hadamard product (element-wise matrix multiplication) - Laplace smoothing ("add one to the numerator and denominator to avoid DIV0")
81
28
508
3
1
42
@deepcohen
Jeremy Cohen
2 years
CMU students Charvi Rastogi and Ivan Stelmakh have been doing interesting research on peer review. Findings include: (1) ICML reviewers were more likely to recommend accepting a paper if it cited them, (2) 36% of ICML reviewers self-reported googling a paper they were reviewing.
@mlcmublog
ML@CMU
2 years
Two experiments in ICML and EC reporting data-driven insights: 1. What are the pros and cons of arXiving your preprint? 2. Do reviewers get biased by citations?
0
15
31
0
4
40
@deepcohen
Jeremy Cohen
1 year
ML systems question: would removing LayerNorm from Transformer speed up wall clock time for training/inference substantially? According to my blog-post-level understanding of the subject, transformer perf is now bottlenecked on moving memory around the GPU, with LayerNorm a big
@lorenzo_noci
Lorenzo Noci
1 year
How do you scale Transformers to infinite depth while ensuring numerical stability? In fact, LayerNorm is not enough. But *shaping* the attention mechanism works! w/ @ChuningLi @mufan_li @bobby_he @THofmann2017 @cjmaddison @roydanroy
Tweet media one
6
34
217
5
6
42
@deepcohen
Jeremy Cohen
5 years
I’m excited for @CMU to go through the mandatory ritual of waiting until one person tests positive, and *then* canceling everything. Who knows — maybe the thing that happened everywhere else won’t happen here!
2
2
39
@deepcohen
Jeremy Cohen
1 year
We absolutely cannot let the NVIDIA recruiters hear about this guy. "Minimum job requirements for new grads: eight (8) NeurIPS papers, three (3) seminal contributions to field of statistics."
@proneat
Praneeth Vepakomma
1 year
When ~24, C.R Rao did this even before he got his PhD :) He already published Cramér–Rao bound, Rao metric and Rao–Blackwellisation that are widely studied across the books and used even today. A motivation for generations: Phd (1948). some results (1945)!
2
16
95
1
1
41
@deepcohen
Jeremy Cohen
2 years
The limits of theory: even the brilliant mathematician John von Neumann couldn't believe that the exponential-time simplex method would be a stellar algorithm for solving linear programs.
@CompSciFact
Computer Science
2 years
"[John von Nuemann] always contended with Dantzig that the simplex method would take an absurdly long amount of time to solve linear programming problems. It appears to be, oh, so far as I know, the one place where Johnny went very badly wrong." -- Philip Wolfe
3
8
60
1
0
40
@deepcohen
Jeremy Cohen
1 year
CMU has breadth requirements, so I know what PAC learning is, but I guess I must have missed the lecture where it was explained how this allegedly foundational framework accounts for the learning of representations, which is the basis of deep learning.
2
1
38
@deepcohen
Jeremy Cohen
2 years
reality, then you’re fine. Work is valuable so long as it’s on the Pareto frontier of rigor and reality.
2
1
37
@deepcohen
Jeremy Cohen
1 year
There is potentially room for mathematics to have real practical impact in the area of optimization for deep learning! The first step is replacing the not-true descent lemma with a correct mathematical theory for the local dynamics of gradient decent (and then SGD).
@gabrielpeyre
Gabriel Peyré
1 year
I will argue (with examples) that advanced maths is not super useful for deep learning, but deep learning leads to fun problems useful to advance mathematics. And mastering (arguably not very advanced) maths is definitely useful to get some insight and advance deep learning!
7
8
115
2
5
38
@deepcohen
Jeremy Cohen
1 year
Weijie's use of quotation marks brings to mind an important point: in the sciences, the word "theory" just means understanding the underlying principles (e.g. the periodic table is a 'theory' of chemistry). In ML, the word "theory" is a pre-existing
@weijie444
Weijie Su
1 year
Good news while attending #ICML2023 : Our deep learning "theory" paper *A Law of Data Separation in Deep Learning* got accepted by PNAS! w/ amazing @hangfeng_he
7
9
124
1
1
37
@deepcohen
Jeremy Cohen
2 years
Everyone knows that there is no correlation between the rates given by these theorems, and the practical performance of deep learning optimization algorithms - thus nullifying the original motivation for these analyses, as narrated by Yurii Nesterov himself.
2
0
38
@deepcohen
Jeremy Cohen
11 months
A thorough study on transformer optimization issues! Note that the optimization failure mode studied in this paper (where training at a too-large LR causes the weights to grow and the loss to trend upward) seems to be conceptually distinct from EOS oscillations.
@Mitchnw
Mitchell Wortsman
11 months
Sharing some highlights from our work on small-scale proxies for large-scale Transformer training instabilities: With fantastic collaborators @peterjliu , @Locchiu , @_katieeverett , many others (see final tweet!), @hoonkp , @jmgilmer , @skornblith ! (1/15)
Tweet media one
5
62
346
2
2
37
@deepcohen
Jeremy Cohen
1 year
Do you know what 3 words almost never appear in ML papers? "We don't know." That's weird, because the last time I checked, we know very little about deep learning! Here's to hoping that future ML papers include less BS and more "we don't know"s 🥂
1
1
33
@deepcohen
Jeremy Cohen
4 years
PyTorch now has high-level APIs for a bunch of fancy autodiff stuff (Jacobian, JVP, VJP, Hessian, HVP, VHP):
@PyTorch
PyTorch
4 years
v1.5: autograd API for Hessians/Jacobians, C++ frontend stable and 100% parity with Python, Better performance on GPU and CPU with Tensor Format ‘channels last’, distributed.rpc stable, Custom C++ class binding Release notes: Blog:
4
220
730
1
5
31
@deepcohen
Jeremy Cohen
2 years
+1 to this. The goal of the conference publication system should be to advance the field rapidly. When you’re dealing with legitimately hard problems, no one person or team is going to have all the answers — instead, many players are each going to contribute a small nugget.
@PreetumNakkiran
Preetum Nakkiran
2 years
these days as a {neurips/icml/iclr} reviewer, I tend to look for “one reason to accept” (eg one good idea/result) vs “reasons to reject”. most ML conf papers are flawed, but having one good idea is enough for the paper to be valuable
8
14
270
1
3
32
@deepcohen
Jeremy Cohen
2 years
more principled and equally effective, that’s bad; but if all the more principled algorithms perform worse, you’re fine. For “explanation” papers: if someone else has a more rigorous way to explain reality, that’s bad, but if all more rigorous explanations lose touch with
1
1
30
@deepcohen
Jeremy Cohen
11 months
@maksym_andr Unironically, probably GPT
Tweet media one
5
1
31
@deepcohen
Jeremy Cohen
2 years
Interesting TMLR paper studying the distribution of pre-activations of large depth, finite width resnets. I have to say that I disagree with this reviewer comment: “the authors should probably spend at least a little effort in demonstrating the possible usefulness of the theory,
@hayou_soufiane
Soufiane Hayou
2 years
My (1st long solo) paper has just been accepted at TMLR. I study the dynamics of the neurons in large (infinite) depth neural networs with skip connections. The dynamics resemble to an SDE that is mainly controlled by the activation function. Link:
5
26
245
1
3
32
@deepcohen
Jeremy Cohen
3 years
“Hutch++” is designed for matrices where the largest few eigenvalues account for most of the trace. My matrix is like that, and using this method turned stochastic trace estimation from “not working” to “working.” The main idea of Hutch++ is this:
@deepcohen
Jeremy Cohen
3 years
PSA: there's now a substantially improved version of the Hutchinson trace estimator: .
2
11
64
1
0
30
@deepcohen
Jeremy Cohen
1 year
Bloomberg is hiring for an LLM training researcher, ideally Toronto-based (h/t @drosen ):
0
4
30
@deepcohen
Jeremy Cohen
2 years
Sanjeev Arora talks about this issue at minute 18 here: . This particular issue is high-profile and is already starting to get fixed, but a larger point remains: there is no evidence that convergence rates are a useful lens for
4
1
29
@deepcohen
Jeremy Cohen
4 years
19/ Over the next few weeks, I'll be doing a few other Twitter threads highlighting various implications of this paper!
2
0
26
@deepcohen
Jeremy Cohen
5 years
May we all age as gracefully as Leonid Pastur, eponym of the famous Marchenko-Pastur theorem (1967), who at age 82 is apparently now writing deep learning papers:
0
5
26
@deepcohen
Jeremy Cohen
2 years
I’ve been asked “when does this process end? Next year, will we hear that understanding neural net optimization requires fourth and fifth order expansions?” I believe the answer is no: I think third order will be both necessary and sufficient. The only reason why higher
@typedfemale
typedfemale
2 years
@soumithchintala only first and second-order taylor expansions should be legal
1
0
14
5
3
26
@deepcohen
Jeremy Cohen
2 years
In my particular case, I decided to work on connecting theory and practice in deep learning, because I judged that I was more interested in theory than most applied people, and more willing to "get my hands dirty" with experimentation than the theorists.
2
0
26
@deepcohen
Jeremy Cohen
2 years
Furthermore, it has recently become clear that these analyses are all wrong, and not in a minor way, but rather in a profound way: they are *causally backwards* on the most important question in deep learning optimization, which is how to set the learning rate.
1
2
26
@deepcohen
Jeremy Cohen
4 years
This paper has a really clear discussion of the relationship between several curvature matrices that commonly pop up in deep learning -- the Gauss-Newton, the Fisher, and the "empirical Fisher." Worth a read if you've ever been confused about these.
1
2
24
@deepcohen
Jeremy Cohen
4 years
"Smoothness" is maybe not an ideal name for the Lipschitz constant of the gradient, considering that a function with higher "smoothness" is less smooth.
1
2
25
@deepcohen
Jeremy Cohen
2 years
operating within a system that has decided, for no reason other than inertia, that the best type of contribution is to prove a theorem that looks like this:
Tweet media one
1
0
25
@deepcohen
Jeremy Cohen
2 years
In the near term, we don't need more cargo cult convergence rates, we need a real understanding of what the existing training recipes are doing. We need to do this methodically, one step at a time, making sure that simple things like gradient descent are well-understood
2
3
25
@deepcohen
Jeremy Cohen
6 years
the way you know Elon Musk is serious about AI is that he uses stochastic gradient descent to run his company.
@TeslaAgnostic
Realist
6 years
This is an actual qoute .
Tweet media one
44
59
244
1
2
23
@deepcohen
Jeremy Cohen
4 years
2/ The teaser video above conveys the main idea: gradient descent with step size η typically operates in a regime (the "Edge of Stability") in which the maximum Hessian eigenvalue hovers just above the numerical value 2/η. This thread provides more detail.
2
0
23
@deepcohen
Jeremy Cohen
11 months
the M. Night Shyamalan approach to research
@roydanroy
Dan Roy
11 months
No no no no no no no no no. Thankfully, this advise was ignored by the authors. But this wide spread but unspoken belief is why NeurIPS/ICML/ICLR reviewing for empirical papers is totally broken.
Tweet media one
26
26
379
0
0
25
@deepcohen
Jeremy Cohen
4 years
Looking for a way to pass the time while being responsibly locked in your apartment? ( #flattenthecurve ). Take a page from this guy in China, and teach your pet some convex optimization: . Coronavirus will eventually pass; gradient descent is forever.
1
3
23
@deepcohen
Jeremy Cohen
2 years
convergence rates. Today, ten years into the deep learning revolution, there is a dire need for legitimate optimization theory: the evolutionary process of "grad student descent" has yielded training recipes that work, but these recipes are brittle - depending in arcane ways on
1
0
23
@deepcohen
Jeremy Cohen
2 years
If George Dantzig submitted the simplex algorithm to NeurIPS today, there's a good chance the response would be: "strong empirical results, but lacks theoretical justification. I encourage the authors to fix these issues and re-submit to the next conference."
1
0
22
@deepcohen
Jeremy Cohen
2 years
The "edge" could be any of: - nobody else has thought to work on this problem - it's a math problem and you're a math genius - you have an uncommon combination of interests/skills - you're going to take a different approach than others
2
0
22
@deepcohen
Jeremy Cohen
5 years
Various companies and governments have invested bajillions of dollars in ML, yet a large fraction of reported deep learning results rely on that one random guy's github repo that comes up when you google "pytorch cifar." I hope he implemented DenseNet correctly!
1
0
21
@deepcohen
Jeremy Cohen
2 years
Basically, I agree with Yann:
@ylecun
Yann LeCun
2 years
This thread exposes a basic misunderstanding. Some believe that science evaluation comes down evaluating scientists' intrinsic abilities, skills, or merit. For them, using AI tools is like "cheating". But science must solely evaluate *impact*. It's not a beauty contest.
23
39
385
1
0
20
@deepcohen
Jeremy Cohen
2 years
Nice! This paper identifies, and rigorously analyzes, a setting in which gradient flow fits a “sharp” (high curvature) solution that generalizes badly, yet gradient descent with a non-infinitesimal LR is forced to fit a flat solution that generalizes better.
@SebastienBubeck
Sebastien Bubeck
2 years
Why do neural networks generalize? IMO we still have no (good) idea. Recent emerging hypothesis: NN learning dynamics discovers *general-purpose circuits* (e.g., induction head in transformers). In we take a first step to prove this hypothesis. 1/8
9
45
316
0
2
22
@deepcohen
Jeremy Cohen
11 months
I’m giving a talk at INFORMS tomorrow morning around 10:15am at session WB 33 in CC-North 221C. The topic is optimization dynamics of deep learning. I’m around for the rest of the day and would love to chat with classical optimization people about deep learning - DMs are open.
0
0
22
@deepcohen
Jeremy Cohen
4 years
5/ More generally, on a multidimensional quadratic f(x) = ½ x' A x + b'x + c, if the curvature matrix "A" has any eigenvalue greater than 2/η, then gradient descent spins out of control along the corresponding eigenvector.
1
2
21
@deepcohen
Jeremy Cohen
3 years
Want to chat about this work? Come visit the ICLR poster session today at 5pm PST!
@deepcohen
Jeremy Cohen
4 years
Our new ICLR paper demonstrates that when neural networks are trained using *full-batch* gradient descent: (1) the training dynamics obey a surprisingly simple "story", and (2) this story contradicts a lot of conventional wisdom in optimization.
10
109
679
0
3
21
@deepcohen
Jeremy Cohen
6 years
Can't wait to engage in that classic Jewish Christmas tradition: hyperparameter sweep using all your lab's GPUs.
0
1
21
@deepcohen
Jeremy Cohen
4 years
18/ Overall, we hope that our paper will both (1) nudge the neural net optimization community away from widespread presumptions that appear to be false, and also (2) point the way forward by identifying precise empirical phenomena that are suitable for further study.
1
0
19
@deepcohen
Jeremy Cohen
1 year
@dmdohan @davidchalmers42 @mhutter42 @ShaneLegg @NoamShazeer @ilyasu Ilya Sutskever had an ICML 2011 paper on optimizing RNNs where he wrote the following:
Tweet media one
3
4
20
@deepcohen
Jeremy Cohen
2 years
if enough eyes get on the problem) but rather the fact that nobody works on it, because the talented people who should be working on this problem are instead wasting their time pretending to prove convergence rates.
2
0
21
@deepcohen
Jeremy Cohen
2 years
Permanent research positions have, historically, been scarce. If you want to stay in research, you should aim to be world-class at whatever it is that you do. The secret to success is to *pick that thing based on* the hand of cards you've already been dealt.
1
0
19
@deepcohen
Jeremy Cohen
2 years
Notably, I avoid working on high-profile open problems that are basically math problems. The principle is: don't waste your time trying to do something that someone else can do better!
0
0
20
@deepcohen
Jeremy Cohen
1 year
intellectual dishonesty, as papers with good algorithms are pressured to include phony theoretical justifications, and theory papers are pressured to include phony experimental support.
1
0
20
@deepcohen
Jeremy Cohen
1 year
Another interesting finding is that residual connections helped with not only optimization (the problem they are usually said to solve) but generalization too. I believe James Martens & collaborators also observed here that removing residual connections
@maksym_andr
Maksym Andriushchenko
1 year
Interesting work about scaling up *plain* MLPs (yes, even without patch extraction). You can get as far as 93.6% accuracy on CIFAR-10 with pre-training on ImageNet-21k. Impressive to see how far one can push data+compute even for such naive architectures!
Tweet media one
1
9
53
2
2
20
@deepcohen
Jeremy Cohen
4 years
I highly recommend this linear algebra book, especially for people who were not math majors. It presents everything from the POV of abstract vector spaces (rather than matrices), which, for me at least, was a great way to gain intuition.
3
2
17