Jeremy Cohen @deepcohen Twitter profile | Pikagi

Pikagi

Jeremy Cohen

@deepcohen

3,843

Followers

894

Following

88

Media

1,070

Statuses

PhD student in machine learning at Carnegie Mellon. The goal of my research is to turn deep learning into a real engineering discipline.

Pittsburgh, PA

https://t.co/A2TgwAbPqc

Joined September 2011

Don't wanna be here? Send us removal request.

Pinned Tweet

@deepcohen

Jeremy Cohen

2 years

This amazing paper by Alex Damian, Eshaan Nichani, and Jason Lee is a critical contribution to deep learning theory and optimization theory. The paper contains the key to understanding many surprising aspects of neural net training dynamics, including

Tweet card media

Self-Stabilization: The Implicit Bias of Gradient Descent at the...

Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the...

6

63

425

Last Seen Profiles

@marich_yeba

@hyemu_official

@Goosus_goofster

@y_muu0

@pemula_saja13

@mingshi_chen

@hazardclears

@JeanaeG64245

@helengprz

@x4Mir4

@Martkrassevich_

@BinorRaja

@1L0V3SM0K3

@haku_goods

@BenLitt69863300

@_blvckwhite_

@Sora_Motion

@SamBryanGarcia1

@lithameyile

@7moood_985

@DhajrK89689

@sears100kathy

@PonuryGrajek

@liang1367502

@Nickcboy

@missjadie

@kiria1221

@BinorRaja

@AndrewHazlett

@shaunthesheep

@VanessaCoppel

@Loki_2804

@pitchyninii

@marcellanga

@lostcorruptwrld

@ssuncolor

@deepcohen

Jeremy Cohen

1 year

Today I learned that Noam Shazeer, who authored a bunch of the important Google NLP papers, was an IMO gold medalist

Tweet media one

11

46

731

@deepcohen

Jeremy Cohen

4 years

Our new ICLR paper demonstrates that when neural networks are trained using *full-batch* gradient descent: (1) the training dynamics obey a surprisingly simple "story", and (2) this story contradicts a lot of conventional wisdom in optimization.

10

109

679

@deepcohen

Jeremy Cohen

9 months

There is a discussion on ML reddit about a paper which shows loss curves for different models that seem to "line up." People were saying that it looks suspicious. Actually, in my experience, this is a real thing that happens when the data batch ordering is the same.

Tweet media one

31

33

475

@deepcohen

Jeremy Cohen

5 years

If you thought that deep learning was a crazily active research field:

Tweet media one

3

56

382

@deepcohen

Jeremy Cohen

3 years

At the CMU ML PhD retreat, we played a trivia round of “machine learning puns.” The solution to each of the below questions is a machine learning term that one might see e.g. on a cheat sheet for an intro ML course. Answers are in the next tweet.

Tweet media one

Tweet media two

4

49

375

@deepcohen

Jeremy Cohen

2 years

The “theory-vs.-practice spectrum” is not a helpful concept. I think that a better mental model is a 2-dimensional rigor/reality plane. The goal is to always be on the Pareto frontier of rigor and reality. For algorithms papers: if someone else has another algorithm that is …

5

56

376

@deepcohen

Jeremy Cohen

2 years

I can’t empathize at all with the take that LLM-assisted writing of academic research is bad, much less a form of “plagiarism.”Research isn’t schoolwork, and it’s not a zero sum game between researchers; it’s a contest between humanity and Nature - all of us are on the same side.

5

31

322

@deepcohen

Jeremy Cohen

2 years

Everyone interested in DL optimization should read the preface to Yurii Nesterov's famous textbook. It was only *after* theoretical analyses became predictive of experimental results that it became common, and then obligatory, for scientific papers about optimization to prove

Tweet media one

9

45

321

@deepcohen

Jeremy Cohen

4 years

@wfithian @RandomlyWalking In a WSJ op-ed last month, these same two guys used "fraction of NBA players who tested positive for COVID by March 19" as an estimate for "fraction of the COVID-positive population in cities with NBA teams." Craziest thing I've ever read.

Tweet media one

10

26

254

@deepcohen

Jeremy Cohen

2 years

I tell new PhD students to pick a research topic according to three criteria: (1) the problem should be important, (2) it should have a reasonable chance of being solvable, and (3) you should personally have a unique edge.

@nntaleb

Nassim Nicholas Taleb

2 years

The only writing advice I've ever given: write the book that nobody else can write. If there is a single person on Planet Earth who can write anything close to it, find a hobby. Generalize to every line you write. Those who didn't follow such a guideline are punished by ChatGPT.

98

412

3K

6

44

242

@deepcohen

Jeremy Cohen

2 years

This is an interesting new paper on the topic of feature learning. They show that several aspects of feature learning in neural nets can be reproduced in a simpler kernel-based model dubbed a “Recursive Feature Machine” (RFM)

Tweet card media

Mechanism of feature learning in deep fully connected networks and...

In recent years neural networks have achieved impressive results on many technological and scientific tasks. Yet, the mechanism through which these models automatically select features, or...

6

40

230

@deepcohen

Jeremy Cohen

2 years

Find someone who looks at you the way NeurIPS reviewers look at “\begin{theorem}”.

7

6

228

@deepcohen

Jeremy Cohen

6 years

1/ I'm excited to share our work on randomized smoothing, a PROVABLE adversarial defense in L2 norm which works on ImageNet! We achieve a *provable* top-1 accuracy of 49% in the face of adversarial perturbations with L2 norm less than 0.5 (=127/255).

2

60

209

@deepcohen

Jeremy Cohen

4 years

My take is that NeurIPS/ICML are ML methods venues, and application papers are better off being published in the literature of the relevant field. The real problem is that ML methods work is seen by CS depts as more prestigious than ML applications work.

5

30

215

@deepcohen

Jeremy Cohen

2 years

There is a gaping hole in the literature regarding the purpose of weight decay in deep learning. Nobody knows what weight decay does! AFAIK, the last comprehensive look at weight decay was this 2019 paper , which argued that weight decay

Tweet card media

Three Mechanisms of Weight Decay Regularization

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional...

@cosminnegruseri

Cosmin Negruseri

@cosminnegruseri

2 years

@zacharynado this is great! curious what do you think about adamW, seems like a critical component for chinchilla, but you don't mention it in the guide

2

0

1

9

19

178

@deepcohen

Jeremy Cohen

11 months

I think the reason why second-order methods keep underperforming relative to first-order methods in deep learning is that first-order methods are more powerful than the theory gives them credit for. First-order methods + large step sizes can implicitly access specific **third

@yaroslavvb

Yaroslav Bulatov

11 months

People tried to make 2nd-order methods practical for NN's, without obvious success. The obstacle is likely the "stochastic" part and not the "non-linear part", as similar thing happened in the linear setting. Consider works of Nocedal 10 years ago, Schraudolph 20 years ago

7

19

139

3

19

158

@deepcohen

Jeremy Cohen

1 year

This fascinating ICML '22 paper argues that the practical effectiveness of the KFAC optimizer is unrelated to its original motivation as approximating the Fisher information, and is instead due to an implicit effect stemming from a damping heuristic:

Tweet card media

Gradient Descent on Neurons and its Link to Approximate...

Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be...

3

20

131

@deepcohen

Jeremy Cohen

1 year

Our lack of basic knowledge about DL optimizer dynamics is holding up the whole rest of the field. The optimizer is the lowest level of the deep learning 'stack' - all DL happens through an optimizer. Until we understand optimization, it will always be a confounder.

@yoavgo

(((ل()(ل() 'yoav))))👾

1 year

@haldaume3 maybe if you used better features? did you tune your learning rate? which optimizer were you using, what if you tried adam? oh you used bert-base and not roberta-large? its the same story all the time

3

1

40

3

6

123

@deepcohen

Jeremy Cohen

4 years

@seanjtaylor This book, written by the inventor of A-star search, is a freakishly thorough history of the whole field of AI: .

2

29

120

@deepcohen

Jeremy Cohen

4 years

Square loss works basically as well as cross-entropy loss on classification tasks. For example, square loss gets 76.0 accuracy for ResNet-50 on ImageNet, compared to 76.1 for cross-entropy.

@arxiv_cs_LG

cs.LG Papers

4 years

Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks. Like Hui and Mikhail Belkin

1

6

18

7

16

112

@deepcohen

Jeremy Cohen

5 years

If anyone out there is chasing citations, you could write “Convolutional Neural Networks for Coronavirus Detection” in four hours tonight, and six months from now it will be the most cited paper in the history of CS.LG.

4

5

89

@deepcohen

Jeremy Cohen

3 years

People learning about NTK are often confused by the following apparent paradox: in the NTK regime, the last-layer feature kernel (and the NTK) do not evolve in the infinite-width limit, yet somehow the network still fits the training dataset. How can the network fit the ....

2

7

87

@deepcohen

Jeremy Cohen

2 years

I'll be at NeurIPS from Tuesday - Saturday. I'd love to chat with anyone interested in neural net training dynamics, neural net optimization theory, neural net initialization, or related topics. Happy to explain "edge of stability" to anyone interested. DM me to schedule!

1

6

79

@deepcohen

Jeremy Cohen

3 years

The theory of optimization for deep learning is still in the luminiferous aether phase. If you work in theory and want to know what gradient descent *actually does* when it trains neural networks, our paper was written for you:

Tweet card media

Gradient Descent on Neural Networks Typically Occurs at the Edge...

We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum...

@tomgoldsteincs

Tom Goldstein

@tomgoldsteincs

3 years

My recent talk at the NSF town hall focused on the history of the AI winters, how the ML community became "anti-science," and whether the rejection of science will cause a winter for ML theory. I'll summarize these issues below...🧵

28

217

924

3

11

73

@deepcohen

Jeremy Cohen

1 year

It's great to see Dr. Chernoff beating the tail bounds.

@stat110

Joe Blitzstein

1 year

My colleague Herman Chernoff is turning 100 this year! We're having a centennial celebration on May 5 at Harvard: Hope to see a lot of you there!

1

12

74

1

6

76

@deepcohen

Jeremy Cohen

2 years

I’ve seen people justify LLMs along the lines of “the LLM will only be used for editing, not for writing the paper.” IMO, this is besides the point. I don’t care if an LLM came up with the research idea, ran the experiments, and proved the theorems! If someone has figured

@deepcohen

Jeremy Cohen

2 years

I can’t empathize at all with the take that LLM-assisted writing of academic research is bad, much less a form of “plagiarism.”Research isn’t schoolwork, and it’s not a zero sum game between researchers; it’s a contest between humanity and Nature - all of us are on the same side.

5

31

322

1

7

65

@deepcohen

Jeremy Cohen

3 years

PSA: there's now a substantially improved version of the Hutchinson trace estimator: .

Tweet card media

Hutch++: Optimal Stochastic Trace Estimation

We study the problem of estimating the trace of a matrix $A$ that can only be accessed through matrix-vector multiplication. We introduce a new randomized algorithm, Hutch++, which computes a $(1...

2

11

64

@deepcohen

Jeremy Cohen

3 years

Answers:

Tweet media one

Tweet media two

3

1

62

@deepcohen

Jeremy Cohen

9 months

The first time I saw this, I assumed I either had a bug or was going crazy. But then I kept noticing it. Comparing notes with @alex_damian_ , he said he independently noticed the same thing. So it seems to be a legit phenomenon (though one that is still unexplained).

5

0

64

@deepcohen

Jeremy Cohen

1 year

A lot of peer review angst originates from the unfortunate custom in the ML literature that all ML papers -- even deep learning ones -- are to be written under the absurd pretense that theory and practice co-exist harmoniously today.

@minimario1729

Alex Gu

1 year

@thegautamkamath yeah, i basically said the contents of my tweet in my review. another reviewer disagreed, they said this is neurips and not a math journal so experiments are needed😆 interesting to see the differences, lots of opinions and healthy discussions on this paper all around!

1

0

8

3

5

63

@deepcohen

Jeremy Cohen

2 years

@ethanCaballero I would recommend the recipe "preLN + residual connections": x_{L+1} = MLP(LN(x_L)) + x_L. This recipe is supposed to have magical trainability properties - see section 5 in this paper: . (The theory in this paper is not BS.)

Tweet card media

Critical Initialization of Wide and Deep Neural Networks through...

Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity, the network function is a Gaussian process (GP) and...

3

3

59

@deepcohen

Jeremy Cohen

2 years

I've noticed that a lot of people have the misconception that infinitely-wide neural nets only train as kernels if the learning rate is tiny. (Maybe b/c the NTK paper only studied gradient flow.). In fact, you only need the learning rate to be less than 1 / [max NTK eigenvalue]

1

1

57

@deepcohen

Jeremy Cohen

2 years

As noted in Remark 4 of this preprint, in neural network optimization, a gradient Lipschitz assumption (L-smoothness) is invalid, while a Hessian Lipschitz assumption might be valid. Why would the first and second derivatives have such different regularity properties? Well …

1

8

54

@deepcohen

Jeremy Cohen

2 years

One reason why the theory/practice dichotomy is unhelpful is that every theorist wants their work to be useful, and, similarly, every applied researcher wants their work to be mathematically principled. Nobody ever says “my career goal is to prove irrelevant theorems” or

3

2

50

@deepcohen

Jeremy Cohen

1 year

banning kernel regression with the kernel k(x, y) = π − cos⁻¹(x, y) because it is equivalent to a neural network with an infinite number of parameters

@norabelrose

Nora Belrose

1 year

I'm opposed to any AI regulation based on absolute capability thresholds, as opposed to indexing to some fraction of state-of-the-art capabilities. The Center for AI Policy is proposing thresholds which already include open source Llama 2 (7B). This is ridiculous.

Tweet media one

56

37

406

0

2

50

@deepcohen

Jeremy Cohen

1 year

If you're doing Taylor expansions on neural nets and need to replace ReLU with a smooth activation, consider the one proposed here: . It's closer to ReLU than are GeLU/ELU/softplus, and it can be used as a drop-in replacement for ReLU on SOTA archs.

Tweet card media

Smooth activations and reproducibility in deep networks

Deep networks are gradually penetrating almost every domain in our lives due to their amazing success. However, with substantive performance accuracy improvements comes the price of...

1

8

48

@deepcohen

Jeremy Cohen

5 years

Neither "Anna Dudley" or "Sean Kanne" is a real person, according to their university directories. This fake account thing is unbelievably sketchy, @iclr_conf !

Tweet card media

NeuralUCB: Contextual Bandits with Neural Network-Based Exploration

We study the stochastic contextual bandit problem, where the reward is generated from an unknown bounded function with additive noise. We propose the NeuralUCB algorithm, which leverages the...

3

11

45

@deepcohen

Jeremy Cohen

9 months

I’m giving a talk at 2:40pm at the Heavy Tails workshop on Friday. The talk is about DL optimization dynamics in general, and adaptive gradient methods in particular. I’m also around NeurIPS starting Wednesday, and would love to meet people with common interests - send me a DM!

3

2

49

@deepcohen

Jeremy Cohen

1 year

There's a funny "Where are they now?" post on the sketchy econ forum EJMR in which some economist declares that Noam went on to have a successful life because ... he subsequently scored well in the Putnam competition in college.

Tweet media one

2

1

49

@deepcohen

Jeremy Cohen

2 years

they are not spending their money in order to set up a “who is the best computer scientist” contest. They are looking for results!

1

0

48

@deepcohen

Jeremy Cohen

2 years

@beenwrekt @KameronDHarris I'd look at these: - - - -

2

2

48

@deepcohen

Jeremy Cohen

2 years

out how to do this, my only reaction is that I would like them to repeat this process on the 30 hard open problems in deep learning theory that are blockers for the research I want to do. When the US Congress or Google allocates funding for CS research over other priorities,

3

1

45

@deepcohen

Jeremy Cohen

2 years

I would love if the other people in my field of ML used LLMs to write clearer or otherwise better papers!

1

1

44

@deepcohen

Jeremy Cohen

2 years

“my career goal is to publish unprincipled hacks that get SOTA.”

3

1

41

@deepcohen

Jeremy Cohen

1 year

Someone should also give a talk “the uneasy relationship between deep learning and computer science,” about whether the methodology of computer science is the right methodology for studying deep learning.

@boazbaraktcs

Boaz Barak

1 year

Looking forward to talking in Sapienza University of Rome next week on "The uneasy relation between deep learning and statistics" Spent an unhealthy amount of time on the images for the title slide 😀

Tweet media one

7

33

326

1

1

41

@deepcohen

Jeremy Cohen

1 year

“Backpropagation through time” (It’s autodiff for RNNs)

@moyix

Brendan Dolan-Gavitt

1 year

What are some fancy sounding math terms that turn out to be something incredibly simple? Two I know of: - Hadamard product (element-wise matrix multiplication) - Laplace smoothing ("add one to the numerator and denominator to avoid DIV0")

81

28

508

3

1

42

@deepcohen

Jeremy Cohen

2 years

CMU students Charvi Rastogi and Ivan Stelmakh have been doing interesting research on peer review. Findings include: (1) ICML reviewers were more likely to recommend accepting a paper if it cited them, (2) 36% of ICML reviewers self-reported googling a paper they were reviewing.

@mlcmublog

ML@CMU

2 years

Two experiments in ICML and EC reporting data-driven insights: 1. What are the pros and cons of arXiving your preprint? 2. Do reviewers get biased by citations?

0

15

31

0

4

40

@deepcohen

Jeremy Cohen

1 year

ML systems question: would removing LayerNorm from Transformer speed up wall clock time for training/inference substantially? According to my blog-post-level understanding of the subject, transformer perf is now bottlenecked on moving memory around the GPU, with LayerNorm a big

@lorenzo_noci

Lorenzo Noci

1 year

How do you scale Transformers to infinite depth while ensuring numerical stability? In fact, LayerNorm is not enough. But *shaping* the attention mechanism works! w/ @ChuningLi @mufan_li @bobby_he @THofmann2017 @cjmaddison @roydanroy

Tweet media one

6

34

217

5

6

42

@deepcohen

Jeremy Cohen

5 years

I’m excited for @CMU to go through the mandatory ritual of waiting until one person tests positive, and *then* canceling everything. Who knows — maybe the thing that happened everywhere else won’t happen here!

2

2

39

@deepcohen

Jeremy Cohen

1 year

We absolutely cannot let the NVIDIA recruiters hear about this guy. "Minimum job requirements for new grads: eight (8) NeurIPS papers, three (3) seminal contributions to field of statistics."

@proneat

Praneeth Vepakomma

1 year

When ~24, C.R Rao did this even before he got his PhD :) He already published Cramér–Rao bound, Rao metric and Rao–Blackwellisation that are widely studied across the books and used even today. A motivation for generations: Phd (1948). some results (1945)!

2

16

95

1

1

41

@deepcohen

Jeremy Cohen

2 years

The limits of theory: even the brilliant mathematician John von Neumann couldn't believe that the exponential-time simplex method would be a stellar algorithm for solving linear programs.

@CompSciFact

Computer Science

2 years

"[John von Nuemann] always contended with Dantzig that the simplex method would take an absurdly long amount of time to solve linear programming problems. It appears to be, oh, so far as I know, the one place where Johnny went very badly wrong." -- Philip Wolfe

3

8

60

1

0

40

@deepcohen

Jeremy Cohen

1 year

CMU has breadth requirements, so I know what PAC learning is, but I guess I must have missed the lecture where it was explained how this allegedly foundational framework accounts for the learning of representations, which is the basis of deep learning.

2

1

38

@deepcohen

Jeremy Cohen

2 years

reality, then you’re fine. Work is valuable so long as it’s on the Pareto frontier of rigor and reality.

2

1

37

@deepcohen

Jeremy Cohen

11 months

@PreetumNakkiran This is EOS!

Tweet card media

Self-Stabilization: The Implicit Bias of Gradient Descent at the...

Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the...

4

2

39

@deepcohen

Jeremy Cohen

1 year

@DimitrisPapail @SebastienBubeck @MSFTResearch @OpenAI @bing God's message for you:

Tweet media one

1

1

38

@deepcohen

Jeremy Cohen

1 year

There is potentially room for mathematics to have real practical impact in the area of optimization for deep learning! The first step is replacing the not-true descent lemma with a correct mathematical theory for the local dynamics of gradient decent (and then SGD).

@gabrielpeyre

Gabriel Peyré

1 year

I will argue (with examples) that advanced maths is not super useful for deep learning, but deep learning leads to fun problems useful to advance mathematics. And mastering (arguably not very advanced) maths is definitely useful to get some insight and advance deep learning!

7

8

115

2

5

38

@deepcohen

Jeremy Cohen

1 year

Weijie's use of quotation marks brings to mind an important point: in the sciences, the word "theory" just means understanding the underlying principles (e.g. the periodic table is a 'theory' of chemistry). In ML, the word "theory" is a pre-existing

@weijie444

Weijie Su

1 year

Good news while attending #ICML2023 : Our deep learning "theory" paper *A Law of Data Separation in Deep Learning* got accepted by PNAS! w/ amazing @hangfeng_he

7

9

124

1

1

37

@deepcohen

Jeremy Cohen

2 years

Everyone knows that there is no correlation between the rates given by these theorems, and the practical performance of deep learning optimization algorithms - thus nullifying the original motivation for these analyses, as narrated by Yurii Nesterov himself.

2

0

38

@deepcohen

Jeremy Cohen

11 months

A thorough study on transformer optimization issues! Note that the optimization failure mode studied in this paper (where training at a too-large LR causes the weights to grow and the loss to trend upward) seems to be conceptually distinct from EOS oscillations.

@Mitchnw

Mitchell Wortsman

11 months

Sharing some highlights from our work on small-scale proxies for large-scale Transformer training instabilities: With fantastic collaborators @peterjliu , @Locchiu , @_katieeverett , many others (see final tweet!), @hoonkp , @jmgilmer , @skornblith ! (1/15)

Tweet media one

5

62

346

2

2

37

@deepcohen

Jeremy Cohen

1 year

Do you know what 3 words almost never appear in ML papers? "We don't know." That's weird, because the last time I checked, we know very little about deep learning! Here's to hoping that future ML papers include less BS and more "we don't know"s 🥂

1

1

33

@deepcohen

Jeremy Cohen

4 years

PyTorch now has high-level APIs for a bunch of fancy autodiff stuff (Jacobian, JVP, VJP, Hessian, HVP, VHP):

@PyTorch

PyTorch

4 years

v1.5: autograd API for Hessians/Jacobians, C++ frontend stable and 100% parity with Python, Better performance on GPU and CPU with Tensor Format ‘channels last’, distributed.rpc stable, Custom C++ class binding Release notes: Blog:

4

220

730

1

5

31

@deepcohen

Jeremy Cohen

2 years

+1 to this. The goal of the conference publication system should be to advance the field rapidly. When you’re dealing with legitimately hard problems, no one person or team is going to have all the answers — instead, many players are each going to contribute a small nugget.

@PreetumNakkiran

Preetum Nakkiran

@PreetumNakkiran

2 years

these days as a {neurips/icml/iclr} reviewer, I tend to look for “one reason to accept” (eg one good idea/result) vs “reasons to reject”. most ML conf papers are flawed, but having one good idea is enough for the paper to be valuable

8

14

270

1

3

32

@deepcohen

Jeremy Cohen

2 years

more principled and equally effective, that’s bad; but if all the more principled algorithms perform worse, you’re fine. For “explanation” papers: if someone else has a more rigorous way to explain reality, that’s bad, but if all more rigorous explanations lose touch with

1

1

30

@deepcohen

Jeremy Cohen

11 months

@maksym_andr Unironically, probably GPT

Tweet media one

5

1

31

@deepcohen

Jeremy Cohen

2 years

Interesting TMLR paper studying the distribution of pre-activations of large depth, finite width resnets. I have to say that I disagree with this reviewer comment: “the authors should probably spend at least a little effort in demonstrating the possible usefulness of the theory,

@hayou_soufiane

Soufiane Hayou

@hayou_soufiane

2 years

My (1st long solo) paper has just been accepted at TMLR. I study the dynamics of the neurons in large (infinite) depth neural networs with skip connections. The dynamics resemble to an SDE that is mainly controlled by the activation function. Link:

5

26

245

1

3

32

@deepcohen

Jeremy Cohen

3 years

“Hutch++” is designed for matrices where the largest few eigenvalues account for most of the trace. My matrix is like that, and using this method turned stochastic trace estimation from “not working” to “working.” The main idea of Hutch++ is this:

@deepcohen

Jeremy Cohen

3 years

PSA: there's now a substantially improved version of the Hutchinson trace estimator: .

2

11

64

1

0

30

@deepcohen

Jeremy Cohen

1 year

Bloomberg is hiring for an LLM training researcher, ideally Toronto-based (h/t @drosen ):

Tweet card media

Find Jobs External Careers

bloomberg.avature.net

0

4

30

@deepcohen

Jeremy Cohen

2 years

Sanjeev Arora talks about this issue at minute 18 here: . This particular issue is high-profile and is already starting to get fixed, but a larger point remains: there is no evidence that convergence rates are a useful lens for

4

1

29

@deepcohen

Jeremy Cohen

4 years

19/ Over the next few weeks, I'll be doing a few other Twitter threads highlighting various implications of this paper!

2

0

26

@deepcohen

Jeremy Cohen

5 years

May we all age as gracefully as Leonid Pastur, eponym of the famous Marchenko-Pastur theorem (1967), who at age 82 is apparently now writing deep learning papers:

Tweet card media

On Random Matrices Arising in Deep Neural Networks. Gaussian Case

The paper deals with distribution of singular values of product of random matrices arising in the analysis of deep neural networks. The matrices resemble the product analogs of the sample...

0

5

26

@deepcohen

Jeremy Cohen

2 years

I’ve been asked “when does this process end? Next year, will we hear that understanding neural net optimization requires fourth and fifth order expansions?” I believe the answer is no: I think third order will be both necessary and sufficient. The only reason why higher

@typedfemale

typedfemale

2 years

@soumithchintala only first and second-order taylor expansions should be legal

1

0

14

5

3

26

@deepcohen

Jeremy Cohen

2 years

In my particular case, I decided to work on connecting theory and practice in deep learning, because I judged that I was more interested in theory than most applied people, and more willing to "get my hands dirty" with experimentation than the theorists.

2

0

26

@deepcohen

Jeremy Cohen

2 years

Furthermore, it has recently become clear that these analyses are all wrong, and not in a minor way, but rather in a profound way: they are *causally backwards* on the most important question in deep learning optimization, which is how to set the learning rate.

1

2

26

@deepcohen

Jeremy Cohen

4 years

This paper has a really clear discussion of the relationship between several curvature matrices that commonly pop up in deep learning -- the Gauss-Newton, the Fisher, and the "empirical Fisher." Worth a read if you've ever been confused about these.

Tweet card media

Limitations of the Empirical Fisher Approximation for Natural...

Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order...

1

2

24

@deepcohen

Jeremy Cohen

4 years

"Smoothness" is maybe not an ideal name for the Lipschitz constant of the gradient, considering that a function with higher "smoothness" is less smooth.

1

2

25

@deepcohen

Jeremy Cohen

2 years

operating within a system that has decided, for no reason other than inertia, that the best type of contribution is to prove a theorem that looks like this:

Tweet media one

1

0

25

@deepcohen

Jeremy Cohen

2 years

In the near term, we don't need more cargo cult convergence rates, we need a real understanding of what the existing training recipes are doing. We need to do this methodically, one step at a time, making sure that simple things like gradient descent are well-understood

2

3

25

@deepcohen

Jeremy Cohen

6 years

the way you know Elon Musk is serious about AI is that he uses stochastic gradient descent to run his company.

@TeslaAgnostic

Realist

6 years

This is an actual qoute .

Tweet media one

44

59

244

1

2

23

@deepcohen

Jeremy Cohen

4 years

2/ The teaser video above conveys the main idea: gradient descent with step size η typically operates in a regime (the "Edge of Stability") in which the maximum Hessian eigenvalue hovers just above the numerical value 2/η. This thread provides more detail.

2

0

23

@deepcohen

Jeremy Cohen

11 months

the M. Night Shyamalan approach to research

@roydanroy

Dan Roy

11 months

No no no no no no no no no. Thankfully, this advise was ignored by the authors. But this wide spread but unspoken belief is why NeurIPS/ICML/ICLR reviewing for empirical papers is totally broken.

Tweet media one

26

26

379

0

0

25

@deepcohen

Jeremy Cohen

4 years

Looking for a way to pass the time while being responsibly locked in your apartment? ( #flattenthecurve ). Take a page from this guy in China, and teach your pet some convex optimization: . Coronavirus will eventually pass; gradient descent is forever.

1

3

23

@deepcohen

Jeremy Cohen

2 years

convergence rates. Today, ten years into the deep learning revolution, there is a dire need for legitimate optimization theory: the evolutionary process of "grad student descent" has yielded training recipes that work, but these recipes are brittle - depending in arcane ways on

1

0

23

@deepcohen

Jeremy Cohen

2 years

If George Dantzig submitted the simplex algorithm to NeurIPS today, there's a good chance the response would be: "strong empirical results, but lacks theoretical justification. I encourage the authors to fix these issues and re-submit to the next conference."

1

0

22

@deepcohen

Jeremy Cohen

2 years

The "edge" could be any of: - nobody else has thought to work on this problem - it's a math problem and you're a math genius - you have an uncommon combination of interests/skills - you're going to take a different approach than others

2

0

22

@deepcohen

Jeremy Cohen

5 years

Various companies and governments have invested bajillions of dollars in ML, yet a large fraction of reported deep learning results rely on that one random guy's github repo that comes up when you google "pytorch cifar." I hope he implemented DenseNet correctly!

1

0

21

@deepcohen

Jeremy Cohen

2 years

Basically, I agree with Yann:

@ylecun

Yann LeCun

2 years

This thread exposes a basic misunderstanding. Some believe that science evaluation comes down evaluating scientists' intrinsic abilities, skills, or merit. For them, using AI tools is like "cheating". But science must solely evaluate *impact*. It's not a beauty contest.

23

39

385

1

0

20

@deepcohen

Jeremy Cohen

2 years

Nice! This paper identifies, and rigorously analyzes, a setting in which gradient flow fits a “sharp” (high curvature) solution that generalizes badly, yet gradient descent with a non-infinitesimal LR is forced to fit a flat solution that generalizes better.

@SebastienBubeck

Sebastien Bubeck

@SebastienBubeck

2 years

Why do neural networks generalize? IMO we still have no (good) idea. Recent emerging hypothesis: NN learning dynamics discovers *general-purpose circuits* (e.g., induction head in transformers). In we take a first step to prove this hypothesis. 1/8

9

45

316

0

2

22

@deepcohen

Jeremy Cohen

11 months

I’m giving a talk at INFORMS tomorrow morning around 10:15am at session WB 33 in CC-North 221C. The topic is optimization dynamics of deep learning. I’m around for the rest of the day and would love to chat with classical optimization people about deep learning - DMs are open.

0

0

22

@deepcohen

Jeremy Cohen

4 years

5/ More generally, on a multidimensional quadratic f(x) = ½ x' A x + b'x + c, if the curvature matrix "A" has any eigenvalue greater than 2/η, then gradient descent spins out of control along the corresponding eigenvector.

1

2

21

@deepcohen

Jeremy Cohen

3 years

Want to chat about this work? Come visit the ICLR poster session today at 5pm PST!

@deepcohen

Jeremy Cohen

4 years

Our new ICLR paper demonstrates that when neural networks are trained using *full-batch* gradient descent: (1) the training dynamics obey a surprisingly simple "story", and (2) this story contradicts a lot of conventional wisdom in optimization.

10

109

679

0

3

21

@deepcohen

Jeremy Cohen

6 years

Can't wait to engage in that classic Jewish Christmas tradition: hyperparameter sweep using all your lab's GPUs.

0

1

21

@deepcohen

Jeremy Cohen

4 years

18/ Overall, we hope that our paper will both (1) nudge the neural net optimization community away from widespread presumptions that appear to be false, and also (2) point the way forward by identifying precise empirical phenomena that are suitable for further study.

1

0

19

@deepcohen

Jeremy Cohen

1 year

@dmdohan @davidchalmers42 @mhutter42 @ShaneLegg @NoamShazeer @ilyasu Ilya Sutskever had an ICML 2011 paper on optimizing RNNs where he wrote the following:

Tweet media one

3

4

20

@deepcohen

Jeremy Cohen

2 years

if enough eyes get on the problem) but rather the fact that nobody works on it, because the talented people who should be working on this problem are instead wasting their time pretending to prove convergence rates.

2

0

21

@deepcohen

Jeremy Cohen

2 years

Permanent research positions have, historically, been scarce. If you want to stay in research, you should aim to be world-class at whatever it is that you do. The secret to success is to *pick that thing based on* the hand of cards you've already been dealt.

1

0

19

@deepcohen

Jeremy Cohen

2 years

Notably, I avoid working on high-profile open problems that are basically math problems. The principle is: don't waste your time trying to do something that someone else can do better!

0

0

20

@deepcohen

Jeremy Cohen

1 year

intellectual dishonesty, as papers with good algorithms are pressured to include phony theoretical justifications, and theory papers are pressured to include phony experimental support.

1

0

20

@deepcohen

Jeremy Cohen

1 year

Another interesting finding is that residual connections helped with not only optimization (the problem they are usually said to solve) but generalization too. I believe James Martens & collaborators also observed here that removing residual connections

Tweet card media

Rapid training of deep neural networks without skip connections or...

Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that...

@maksym_andr

Maksym Andriushchenko

1 year

Interesting work about scaling up *plain* MLPs (yes, even without patch extraction). You can get as far as 93.6% accuracy on CIFAR-10 with pre-training on ImageNet-21k. Impressive to see how far one can push data+compute even for such naive architectures!

Tweet media one

1

9

53

2

2

20

@deepcohen

Jeremy Cohen

4 years

I highly recommend this linear algebra book, especially for people who were not math majors. It presents everything from the POV of abstract vector spaces (rather than matrices), which, for me at least, was a great way to gain intuition.

3

2

17