Hossein Mobahi @TheGradient Twitter profile | Pikagi

Pikagi

Hossein Mobahi

@TheGradient

5,909

Followers

709

Following

112

Media

1,300

Statuses

Senior Research Scientist @GoogleDeepMind . I ∈ Optimization ∩ Machine Learning. Fan of @IronMaiden 🤘.Here to discuss research 🤓

Mountain View, CA

https://t.co/A2xafU0JNl

Joined December 2010

Don't wanna be here? Send us removal request.

Pinned Tweet

@TheGradient

Hossein Mobahi

8 months

(0/17) Grab your🍿 for a thread on some mysteries and explanations connecting flat minima, second order optimization, weight noise, gradient norm penalty, and activation functions😱 There is also a video presentation if you prefer:

Tweet card media

How Hessian Structure Explains Mysteries in Sharpness Regularization

Hossein MobahiSenior Research Scientist, Google ResearchABSTRACT: Recent work has shown that first-order methods like Sharpness-Aware Minimization (SAM) whic...

www.youtube.com

5

74

471

Last Seen Profiles

@Atzazvir

@torontolibrary

@showandSTELL

@VeraBezerraSil1

@ncat

@RYOTA_Equal

@IvanRaiklin

@sss_mmm_

@xXxIceImpxXx

@motityomi

@bastantica123

@yasddias

@CrepesARMORIQUE

@maemuki55554

@XLanka1

@sai_miya

@urmacchiato

@oieiseaeiru_jei

@Pia__Holmberg

@midosheneshen3

@crimsonkamado

@Francescnami

@flytfnbr

@Sniffulls

@mcdowell_bubba

@ewhale_dna

@tmallstar

@kamisaa_1

@Shokei

@Pack161x

@luvU208

@swaggybark

@Michelle_CRUK

@DTLuevano

@v218log

@SunAFC1879

@TheGradient

Hossein Mobahi

3 years

Log(😅) = 💧Log(😄)

45

3K

17K

@TheGradient

Hossein Mobahi

2 years

Today @GoogleAI officially launched the Google Research YouTube channel 🚀 On the channel, we have a range of content including shows such as (1) Meet a Researcher and (2) ResearchBytes, along with (3) Spotlights.

Tweet card media

Google Research

Welcome to the official YouTube channel for Google Research. Explore video content that tackles state of the art research across Google like ML/AI, robotics, theory and algorithms, quantum computing,...

www.youtube.com

12

228

1K

@TheGradient

Hossein Mobahi

4 years

1/5 Self-Distillation loop (feeding predictions as new target values & retraining) improves test accuracy. But why? We show it induces a regularization that progressively limits # of basis functions used to represent the solution. w/ @farajtabar P.Bartlett

Tweet media one

15

193

825

@TheGradient

Hossein Mobahi

4 years

My PhD adviser Yi Ma and academic brother John Wright just put their new & free (pre-production) book online "High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications" . Organized, easy read, w/ lots of visuals.

Tweet media one

7

129

743

@TheGradient

Hossein Mobahi

3 years

1/5 In July 2016, Jitendra Malik gave an inspiring talk at Google, with a slide showing a block diagram of visual pathway in a primate. He said "there are a lot of feedback loops as you see". He then stressed that, yet the current deep neural architectures are mainly feedforward.

Tweet media one

9

84

473

@TheGradient

Hossein Mobahi

3 years

1/10 Heard about implicit regularization of SGD (i.e. bias toward certain solutions not explicitly stated in the objective function) and wondered why it happens? This thread provides some introductory analysis on why SGD prefers solutions with small norm.

Tweet media one

4

75

361

@TheGradient

Hossein Mobahi

3 years

If scientists had logos! via @vardi

Tweet media one

1

64

355

@TheGradient

Hossein Mobahi

3 years

One of the most touching thesis dedications: "To all the students who had to discontinue their PhD because of toxic work environments, and to all the kind and humble researchers who are striving to make academia a better place." from @_vaishnavh 's PhD thesis @SCSatCMU Aug. 2021.

6

26

300

@TheGradient

Hossein Mobahi

4 years

Are neural networks learning or memorizing? It must be learning, otherwise how they generalize so well. But hey, maybe memorization is not against generalization, and maybe necessary when there's little training data for some classes. by @vitalyFM & Zhang.

Tweet media one

4

46

292

@TheGradient

Hossein Mobahi

5 years

One of the most comprehensive studies of generalization to date; ≈40 complexity measures over ≈10K deep models. Surprising observations worthy of further investigations. Fantastic Generalization Measures: w @yidingjiang @bneyshabur @dilipkay S. Bengio

Tweet media one

3

60

264

@TheGradient

Hossein Mobahi

9 months

In case you missed the news: @GoogleAI 's Student Researcher Program (i.e. internship, read below though) for 2024 is now live and you can apply here: Note: Intern is the same as Student Researcher. The only difference is that the former relates to

Tweet card media

Search Jobs - Google Careers

Find your next job at Google — Careers at Google. Search by location, role, skills, and more.

6

54

254

@TheGradient

Hossein Mobahi

3 years

1/11 Earlier this year, I promised to write an introductory thread on calculus of variations and its uses in machine learning. The time has arrived! I am not to give a rigorous treatment, but a minimal intuition here. In one line: it is a formalism for seeking optimal functions.

Tweet media one

4

41

240

@TheGradient

Hossein Mobahi

2 months

After several wonderful years at @GoogleAI , I joined @GoogleDeepMind today. I look forward to continuing my work on foundational ML research with this exceptional team. Big thanks to @hugo_larochelle and @ZoubinGhahrama1 for invaluable support.

23

1

193

@TheGradient

Hossein Mobahi

5 years

1/3 DEMOGEN is a dataset of 756 CNN/ResNet-32 models trained on CIFAR-10/100 w/ various regularization and hyperparameters, leading to wide range of generalization behaviors. Hope dataset can help the community w/ exploring generalization in #deeplearning

1

40

173

@TheGradient

Hossein Mobahi

4 years

To my fellow #NeurIPS area chairs, author response is planned to open tomorrow. A LOT of effort goes into writing these responses even though they are short. Please DO read them CAREFULLY and engage in discussion with reviewers. Do not let R2 determine the fate of a good paper.

1

9

153

@TheGradient

Hossein Mobahi

3 years

1/10 Gaussian forms are prevalent in machine learning. Familiar: normal distribution (of stats), positive definite kernels (of RKHS in SVMs), etc. But I like to highlight it from optimization viewpoint, in light of recent generalization gains by flat minima (e.g. SAM optimizer).

4

40

145

@TheGradient

Hossein Mobahi

3 months

Applications for the @GoogleAI PhD Fellowship Program are now open. The program will be accepting student applications through May 8. It supports graduate students doing innovative research in computer science and related fields as they pursue their PhD and also connects them to

0

30

132

@TheGradient

Hossein Mobahi

3 years

Thank you Shafi Golwasser (Turing Award Winner) for your strong statement in support of the Iranian research community in computer science. It means a lot to our community in general, and to young students in particular, who look up to Turing Award Winners as their role models.

@nhaghtal

Nika Haghtalab

3 years

Tweet media one

1

16

145

2

11

134

@TheGradient

Hossein Mobahi

4 years

Among 20 #NeurIPS2020 submissions I am serving as the AC for, one review has been submitted, and it is quite detailed and high quality. Given that reviews are due by July 24th, I am impressed by the timing and quality of this review 😲🙏 Hope this post motivates other reviewers.

5

2

131

@TheGradient

Hossein Mobahi

2 years

Are you a strong PhD student interested in doing cutting edge research at @GoogleAI ? I have an opening for student researcher position to explore open problems and extensions of Sharpness-Aware Minimization (SAM) w/ @bneyshabur . Please refer to .

Tweet media one

4

26

122

@TheGradient

Hossein Mobahi

5 years

Some Papers: remove local minima of neural networks by adding a single unit. Turns Out: similar trick works for any loss function and without auxiliary units. The Catch: these tricks all move local minima to infinity. Nice note by @jaschasd

@jaschasd

Jascha Sohl-Dickstein

5 years

Eliminating All Bad Local Minima from Loss Landscapes Without Even Adding an Extra Unit It's less than one page. It may be deep. It may be trivial. It will definitely help you understand how some claims in recent theory papers could possibly be true.

Tweet media one

6

178

702

3

17

102

@TheGradient

Hossein Mobahi

4 years

Upcoming Open Workshop: The Analytical Foundations of Deep Learning: Interpretability and Performance Guarantees (Oct. 19-21&23) open to everyone to attend. Covering tutorials, presentations, and brainstorming.

1

29

101

@TheGradient

Hossein Mobahi

4 years

1/6 We propose a “unifying framework” for analyzing implicit bias of neural networks, investigate gradient flow on “linear tensor networks,” and extend existing theoretical results while relaxing some of their convergence assumptions.

2

20

99

@TheGradient

Hossein Mobahi

2 years

1/2 Intriguing observation [leading to a concrete research problem] by DALL·E 2 team: "the rank [of covariance matrix] of the CLIP representation space is drastically reduced when training CLIP with SAM while slightly improving evaluation metrics" 🧐

@OpenAI

OpenAI

2 years

Our newest system DALL·E 2 can create realistic images and art from a description in natural language. See it here:

590

3K

11K

2

21

98

@TheGradient

Hossein Mobahi

5 years

A mystery of our era: why/when/how deep networks generalize? Submit a paper or join discussion @icmlconf workshop. Speakers M. Belkin @chelseabfinn @ShamKakade6 @jasondeanlee @aleks_madry @roydanroy . w/ Bartlett @dawnsongtweets Srebro @bneyshabur @dilipkay

ICML 2019 Workshop

14 June 2019, Long Beach, California, USA 08:30 AM -- 06:00 PM @ Grand Ballroom A Videos: 1 2 3 4

sites.google.com

0

27

96

@TheGradient

Hossein Mobahi

4 years

To honor the memory of those lost in the crash of Ukraine Int. Airlines flight #PS752 in Tehran, University of Toronto has established an Iranian Student Memorial Scholarship to support Iranian students or students from any background studying Iranian studies. Thank you @UofT !

3

9

91

@TheGradient

Hossein Mobahi

7 years

We just put online all the talks for NIPS 2016 workshop on nonconvex optimizations. Do not miss the amazing talks:

1

27

87

@TheGradient

Hossein Mobahi

3 years

Double column format is an implicit (over-)regularization against writing detailed mathematical formulas.

2

0

84

@TheGradient

Hossein Mobahi

4 years

Dear friends @CS_UCLA : I will be giving a talk at CS 201 seminar next week on the implicit regularization of 🧠⚗️(aka knowledge distillation). Let's grab a ☕ at our 🏠's and discuss about research in this virtual meeting. (cc: @zacharylipton emojis.)

1

12

80

@TheGradient

Hossein Mobahi

4 years

🔥New state-of-the-art performance on ImageNet recorded by our "Sharpness-Award Minimization" work; a very simple yet effective optimization method. Paper Performance . w/ @Foret_p @bneyshabur and Ariel Kleiner.

Tweet card media

Sharpness-Aware Minimization for Efficiently Improving Generalization

In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly...

@Foret_p

Pierre Foret

4 years

Fine-tuning the EfficientNet l2-475 from the noisy student paper with SAM increases the imagenet accuracy from 88.2% to 88.6%, giving a new state or the art! With Ariel Kleiner, @TheGradient , and @bneyshabur

0

8

45

1

11

76

@TheGradient

Hossein Mobahi

4 years

To appear in #NeurIPS2020 .

@TheGradient

Hossein Mobahi

4 years

1/5 Self-Distillation loop (feeding predictions as new target values & retraining) improves test accuracy. But why? We show it induces a regularization that progressively limits # of basis functions used to represent the solution. w/ @farajtabar P.Bartlett

Tweet media one

15

193

825

2

6

72

@TheGradient

Hossein Mobahi

4 years

@_arohan_ @geoffreyhinton The universe will start over from a different random seed.

1

0

64

@TheGradient

Hossein Mobahi

4 years

From #NeurIPS2020 program chairs: After screening for various issues, there are approximately 9300 papers ready for the next phase of the review process.

3

4

62

@TheGradient

Hossein Mobahi

3 years

3/5 Earlier today a new SOTA on ImageNet was reported by @quocleix and his team (90.2% on top-1 accuracy). Their approach relies on a feedback loop between a "pair of architectures" (teacher and student). See their tweet here:

Tweet media one

@quocleix

Quoc Le

3 years

Some nice improvement on ImageNet: 90% top-1 accuracy has been achieved :-) This result is possible by using Meta Pseudo Labels, a semi-supervised learning method, to train EfficientNet-L2. More details here:

Tweet media one

16

304

1K

1

3

62

@TheGradient

Hossein Mobahi

3 years

5/5 Today, a few years after Jitendra's prediction, we witness that SOTA on ImageNet relies on rich and broad-range feedback loops. I think this is just the beginning, and we will see (and hopefully also understand) more about complex feedback loops in machine learning models.

3

2

59

@TheGradient

Hossein Mobahi

3 years

And so #NeurIPS2021 AC nightmare time begins. With reviews due this Friday, I have 80% of reviews still missing despite sending reminders 😱 Wondering what is the situation with fellow AC's. Is around 80% what you also have missing?

18

0

59

@TheGradient

Hossein Mobahi

6 years

We are organizing ICML18 Workshop on "Nonconvex Optimization for ML", covering a broad range of exciting topics in this area. Submission deadline: May 22 2018 (23:59PDT).

ICML 2018 Workshop - CFP

Call for Papers

sites.google.com

1

9

56

@TheGradient

Hossein Mobahi

2 years

Junior faculties... Application for @GoogleAI 's Scholar Program will be open in 10 days (Nov 3rd). It provides unrestricted gifts to support research conducted by early-career professors (received their PhD within 7 years ago). Please spread the word

0

8

55

@TheGradient

Hossein Mobahi

4 years

The 2021 Google PhD Fellowship cycle in North America and Europe will open on September 1st, 2020 instead of November 1st. PhD students must be nominated by their university by September 30, 2020.

0

7

54

@TheGradient

Hossein Mobahi

11 months

Tweet media one

0

4

51

@TheGradient

Hossein Mobahi

4 years

@farajtabar @geoffreyhinton @OriolVinyalsML @JeffDean 3/5 Furlanello @zacharylipton @mtschannen Itti & @AnimaAnandkumar noticed repeating self-distill a few iterations creates models w/ better test accuracy. Puzzling as no external info is fed into system; feedback dynamics of distillation loop somehow evolves toward better models!

Tweet media one

2

7

50

@TheGradient

Hossein Mobahi

3 years

The common pattern I find among many successful people is that they define the opportunity rather than letting opportunity define them. Here is an interesting example from @bneyshabur 's journey to graduate school.

@bneyshabur

Behnam Neyshabur

3 years

Totally agree! Anyone screening applications and any applicant thinking their CV is not representative of their skills/potentials, I think you might want to read the story of my own PhD application in this thread: 1/

30

170

1K

0

1

50

@TheGradient

Hossein Mobahi

2 years

Applications for the 2023 @Google PhD Fellowship awards in North America and Europe will officially open as of today Sep 1 through Oct 6. Apply here and find FAQ (for complete eligibility and application information) here

Frequently asked questions – Google Research

FAQ. We support the broader research community through our Faculty Engagement and Student Support programs, and by providing Research Datasets and Tools & Resources.

research.google

3

13

50

@TheGradient

Hossein Mobahi

3 years

#Neurips2021 deadline extended by 48 hours!

NeurIPS 2021 Deadline Extension

Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan NeurIPS 2021 Program Chairs

neuripsconf.medium.com

2

12

51

@TheGradient

Hossein Mobahi

3 years

Classic learning theory (figure) finds overfitting bad, but observations from deep learning contradicts! How overfitting can be benign? Answered in recent overview/analysis by @Andrea__M , P. Bartlett and S. Rakhlin. (analyzes linear models, but may relate to deep network via NTK)

Tweet media one

@Andrea__M

Andrea Montanari

3 years

With Peter Bartlett and Sasha Rakhlin, an overview of a slice of research in theoretical machine learning (and an effort to connect some dots):

2

74

379

2

7

49

@TheGradient

Hossein Mobahi

4 years

Great news for transfer learning researchers. AFAIK, there hasn't been any solid benchmark for transfer across classes with a natural hierarchical structure (adapt a model trained on a leaf class to another leaf class related via a common ancestor). Very much needed has arrived.

@aleks_madry

Aleksander Madry

4 years

Can our models classify Dalmatians as "dogs" even if they are only trained on Poodles? We release a suite of benchmarks for testing model generalization to unseen data subpopulations. w/ @tsiprasd @ShibaniSan Blog (+data): Paper:

Tweet media one

Tweet media two

2

36

177

0

8

49

@TheGradient

Hossein Mobahi

3 years

Have you ever wanted to play with self-distillation and better understand its regularization effect, but didn't have time to write code yourself? Today is your day. @KennethBorup has shared a neat and minimal code that might be a great starting point.

Tweet card media

GitHub - Kennethborup/self_distillation: Self-Distillation with weighted ground-truth targets;...

Self-Distillation with weighted ground-truth targets; ResNet and Kernel Ridge Regression - Kennethborup/self_distillation

0

9

46

@TheGradient

Hossein Mobahi

1 year

A bit of story. Once upon a time, I was a researcher and Phillip was a PhD student, in the same lab at MIT. Every student in the lab was incredible in some way. Phillip's way: wanted to and could go very very deep into seemingly simple research questions. His blog must be a gem.

@phillip_isola

Phillip Isola

1 year

I give a lot of talks but often only a few people see them. I’m going to try a new experiment where I write blog versions of some of the talks I give. Here’s the first one: A short, general-audience intro to "Generative Models of Images" -->

Tweet media one

5

52

389

2

0

47

@TheGradient

Hossein Mobahi

4 years

Springer has introduced COVID-19 quarantine package offering about 500 textbooks *** free *** to download. Of particular interest to me (and perhose you) are engineering, math, and computer science books among them.

1

14

47

@TheGradient

Hossein Mobahi

4 years

#NeurIPS2020 attendees: I will be 🤩 to see you Wed Dec 9th, 9:00am-11:00am (PST) to have a fun discussion on pseudo-labeling by self-distillation and its implicit regularization effects. Find us here👉 . With @farajtabar and Peter Bartlett at @NeurIPSConf .

Tweet media one

1

5

47

@TheGradient

Hossein Mobahi

2 years

Excited to be chairing an oral session with all ace presentations (including 3 best papers) at @iclr_conf . Looking forward to seeing you and hearing your questions in the #ICLR2022 oral session 2 on Understanding Deep Learning [Apr 26 8am GMT (1am PST)].

0

7

47

@TheGradient

Hossein Mobahi

3 years

At #ICLR2021 spotlights next week, we will present Sharpness-Aware Minimization (SAM): a PRINCIPLED optimization method, that is EASY to implement, and achieves SOTA on several standard benchmarks (SOTA ref ). with @Foret_p Airel Kleiner @bneyshabur .

Tweet media one

@bneyshabur

Behnam Neyshabur

3 years

Sharpness-Aware Minimization for Efficiently Improving Generalization (Spotlight at #ICLR2021 ) with @Foret_p , Ariel Kleiber and @TheGradient Paper: Code: Video and Poster: 3/7

Tweet media one

2

4

20

0

8

47

@TheGradient

Hossein Mobahi

1 year

After 12 years of being on Twitter, this is my first non-technical post. Congratulations to Shervin Hajipour, Iranian Women, and all the brave people fighting for Freedom of Iran. This song brings me to tears every time time I listen to.

@Omid_M

Omid Memarian

1 year

Incredible! #ShervinHajipour wins the best "Social Change" song at @RecordingAcad ( #GRAMMYs ) for “Baraye," a song that captured the hearts of millions in Iran and around the world. #MahsaAmini #مهسا_امینی #شروین_حاجی‌پور

88

2K

7K

1

2

44

@TheGradient

Hossein Mobahi

2 years

@bneyshabur I see your point, but building/understanding are distinct goals, and which one to pursue a matter of personal taste. Engineering may come up with cool working ideas, but not necessarily an explanation for why and how they work. For the latter, you need to build up on foundations.

2

0

44

@TheGradient

Hossein Mobahi

3 years

4/5 Such feedback loops across neural networks as we see here, or similarly see in self-distillation methods (figure below) are new and operate at much larger scale than those micro feedbacks in RNNs/LSTMs. They have shown to be a new and promising way to improve generalization.

Tweet media one

1

3

44

@TheGradient

Hossein Mobahi

2 years

🔥🔥Sharpness Aware Minimization (SAM) is now available in Keras🔥🔥 Would be great if people try it out and provide feedback to me or the Keras team about any potential issues. Big thanks to @fchollet and his team for making this happen.

2

10

44

@TheGradient

Hossein Mobahi

4 years

#NeurIPS program chairs take reviewer's performance into account: The top highest-scoring reviewers will be awarded free NeurIPS registrations and will be highlighted in our website. The lowest-scoring reviewers may instead not be invited to review for future conferences.

4

1

43

@TheGradient

Hossein Mobahi

5 years

It's said that deep learning is data hungry, but does it really use every piece of training data in common datasets? Turns out you can discard 10% of ImageNet training set without any drop in test accuracy.

Tweet media one

2

6

42

@TheGradient

Hossein Mobahi

4 years

@farajtabar @geoffreyhinton @OriolVinyalsML @JeffDean @zacharylipton @mtschannen @AnimaAnandkumar 5/5 The regularization gets more aggressive (sparser representation) after each self-distill round. Thus, model may initially benefit & avoid overfitting, but excessive regularization leads to underfitting. Consistent empirically on deepnets: test accuracy first up and then down.

Tweet media one

3

4

41

@TheGradient

Hossein Mobahi

3 years

Excited to see Sharpness-Aware Minimization (SAM optimizer) we have proposed recently (w/ @Foret_p @bneyshabur and Kleiner) is becoming a persistent component in recent state-of-the-art records 😇

@_akhaliq

AK

3 years

Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error pdf: abs: ImageNet SOTA of 86.8% top-1 accuracy after just 34 epochs of training with an NFNet-F5 using the SAM optimizer

Tweet media one

Tweet media two

1

23

85

0

7

41

@TheGradient

Hossein Mobahi

5 years

@roydanroy We have come a long way actually; from elegant ideas to grid search over hyper parameter 😂

0

0

42

@TheGradient

Hossein Mobahi

8 months

@vipul_1011 Sorry to hear that! Here is my “personal” perspective on the matter (the last step in the intern selection process) as a person who regularly interviews intern candidates. There is a huge pile of great candidates and a limited number of openings. One thing that intern hosts hope

4

1

42

@TheGradient

Hossein Mobahi

1 year

@GoogleAI just opened applications for the Award for Inclusion Research program that supports academic research in computing and technology addressing the needs of historically marginalized groups and creates positive societal impact. Apply by July 13 at

0

15

40

@TheGradient

Hossein Mobahi

5 years

For quite sometime (NeurIPS18, ICLR19), we have empirically observed margin at intermediate layers carries significant information about generalization of a deep model. Delighted to see @tengyu has now proved this phenomenon, and provided a cleaner definition of all-layer margin.

@tengyuma

Tengyu Ma

5 years

A new paper on improving the generalization of deep models (w.r.t clean or robust accuracy) by theory-inspired explicit regularizers.

0

84

419

0

3

39

@TheGradient

Hossein Mobahi

2 years

My former PhD adviser @YiMaTweets joined twitter today. Looking forward to tweets about his (unique and rigorous style of) research on ML. Welcome to twitter Yi!

@TheGradient

Hossein Mobahi

4 years

My PhD adviser Yi Ma and academic brother John Wright just put their new & free (pre-production) book online "High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications" . Organized, easy read, w/ lots of visuals.

Tweet media one

7

129

743

0

2

40

@TheGradient

Hossein Mobahi

4 years

I see a rapidly growing success from "self-training" and "self-distillation" type methods recently. There is a lot of opportunity there for theoretical understanding and explanations with huge practical impact as these methods are now at the core to some SOTA models.

@quocleix

Quoc Le

4 years

We researchers love pre-training. Our new paper shows that pre-training is unhelpful when we have a lot of labeled data. In contrast, self-training works well even when we have a lot of labeled data. SOTA on PASCAL segmentation & COCO detection. Link:

Tweet media one

7

229

980

1

2

38

@TheGradient

Hossein Mobahi

4 years

Understanding and predicting generalization performance in deep neural networks with emphasis on "robustness" of predictions. Valuable findings on success and failure of different complexity measures when analyzed via the lens of robust predictions; with both good and bad news.

Tweet media one

@alexandredrouin

Alexandre Drouin

@alexandredrouin

4 years

#NeurIPS2020 paper "In Search of Robust Measures of Generalization" evaluates robustness of generalization theories. None is robust, nevermind fantastic! 😭 w/ @KDziugaite @CasualBrady @nitarshan @ethancaballero Linbo Wang @bouzoukipunks @roydanroy

Tweet media one

1

29

103

0

7

39

@TheGradient

Hossein Mobahi

6 years

Can generalization gap in deep networks be "accurately" predicted from trained_weights plus training_data? New evidence says "Yes" via normalized margin distributions at all layers.

Tweet media one

0

9

39

@TheGradient

Hossein Mobahi

3 years

Summary of Reviewing Policy for #NeurIPS 2021: (1) Review in OpenReview. (2) Rejected submissions won't be made public unless authors want so. (3) No desk rejection. (4) Rolling discussion between authors and reviewers after initial reviews and author responses are submitted.

@NeurIPSConf

NeurIPS Conference

3 years

Check out the updates to the reviewing process for NeurIPS 2021!

2

49

152

0

9

39

@TheGradient

Hossein Mobahi

4 years

🥇For the first time, there is now a competition for predicting generalization performance of deep learning @NeurIPSConf #NeurIPS2020 . With @PGDL_NeurIPS . Do not miss the opportunity 😀

@PGDL_NeurIPS

Predicting Generalization in Deep Learning

4 years

Do you have a theory or a hypothesis that can predict generalization performance in deep learning? Now, for the first time, there is a competition for it & you can validate your idea @NeurIPSConf #NeurIPS2020 . We hope this will instigate progress in understanding generalization.

5

54

165

1

7

38

@TheGradient

Hossein Mobahi

4 years

@farajtabar 2/5 Knowledge distillation by @geoffreyhinton @OriolVinyalsML @JeffDean originally motivated to transfer knowledge from large to smaller networks. Self-distillation is special case with identical architectures; predictions of model are fed back to itself as new target values.

1

1

34

@TheGradient

Hossein Mobahi

4 years

Yay! @NeurIPSConf extended submission deadline for #NeurIPS2020 . Abstract deadline: May 27. Paper deadline: June 3.

0

5

35

@TheGradient

Hossein Mobahi

2 years

On an unrelated note, I have been fortunate to be hosting this smart person @_christinabaek this summer at @GoogleAI and her coauthor, the incredible @yidingjiang as an AI resident at @GoogleAI in the past. Please keep expanding your lab @zicokolter 😀

@_christinabaek

Christina Baek

@_christinabaek

2 years

Estimating out-of-distribution (OOD) performance is hard because labeled data is expensive. Can we predict OOD performance w/ only _unlabeled data_? In our work (), we show this can be done using models’ agreement. w/ @yidingjiang , Aditi R., @zicokolter

12

82

516

2

2

34

@TheGradient

Hossein Mobahi

1 year

Applications are open for the @Google Carbon Removal Research Awards, a program looking to fund selected academic research efforts to better characterize and accelerate the development of new carbon removal approaches. Apply by April 28! Learn more:

1

10

31

@TheGradient

Hossein Mobahi

1 year

First observed by DALL·E2 of @OpenAI , SAM optimizer tends to produce lower-rank representations compared to SGD. Yet explanation for this phenomenon remained missing till today. But wait... @maksym_andr just flipped it around! Pleasure to work w/ him, @dara_bahri , N. Flammarion.

@maksym_andr

Maksym Andriushchenko

1 year

🚨Excited to share our new work “Sharpness-Aware Minimization Leads to Low-Rank Features” ! ❓We know SAM improves generalization, but can we better understand the structure of features learned by SAM? (with @dara_bahri , @TheGradient , N. Flammarion) 🧵1/n

Tweet media one

4

24

157

0

4

32

@TheGradient

Hossein Mobahi

5 years

Excited to give an invited talk at @ipam_ucla 's workshop "Partial Differential Equations & Inverse Problem Methods in Machine Learning" April 2020. And no, I won't talk about heat equation this time (global warming?), but Poisson's equation. Details soon 😎

Tweet card media

Workshop II: PDE and Inverse Problem Methods in Machine Learning - IPAM

www.ipam.ucla.edu

1

5

32

@TheGradient

Hossein Mobahi

3 years

@deviparikh Ugh! They seem to be aggressively replacing everything.

2

0

31

@TheGradient

Hossein Mobahi

4 years

@farajtabar @geoffreyhinton @OriolVinyalsML @JeffDean @zacharylipton @mtschannen @AnimaAnandkumar 🔥🔥NEW🔥🔥"Self-Distillation Amplifies Regularization in Hilbert Space" updated on arXiv ! Changes: 🧹🧹cleaned up proofs🧹🧹 + 💾💾added codes (in appendix) for illustrative example and deep learning experiments💾💾.

Tweet media one

1

2

30

@TheGradient

Hossein Mobahi

4 years

The end of double-descent era and the beginning of multi-descent era for generalization curves. This work proves multi-descent curves exist. In fact, it tells you how to shape them up in the way you wish [for linear regression]! Very interesting work @aminkarbasi 🤓

Tweet media one

@aminkarbasi

Amin Karbasi

4 years

Can the existence of a multiple descent generalization curve be rigorously proven? Can an arbitrary number of descents occur? Can the generalization curve and the locations of descents be designed? We answer yes to all three of these questions.

Tweet media one

5

38

199

1

2

29

@TheGradient

Hossein Mobahi

6 years

ICML Workshop on Nonconvex Optimization was a great success thanks to 7 fantastic speakers (including @HazanPrinceton @svlevine ), 34 high quality (accepted) papers (online: ) and great audience. @AnimaAnandkumar

0

7

29

@TheGradient

Hossein Mobahi

3 years

Bonus/10: A (1.5-page) PDF version of tweet can be also found below.

drive.google.com

4

1

29

@TheGradient

Hossein Mobahi

2 years

New theory on SAM: 1. Provides Convergence Analysis: highly non-trivial for SAM due to 1/||∇L|| in the update rule (intricately couples coordinates). 2. ∇³L: We know by two ∇'s, SAM efficiently approximates ∇² (curvature). This work shows SAM implicitly utilizes ∇³ as well!

@philipmlong

Phil Long

2 years

New paper with Peter Bartlett and @obousquet called "The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima": .

1

9

45

0

0

28

@TheGradient

Hossein Mobahi

2 years

Disappointed with fellow AC's decision #icml2022 . AC instruction: "Only papers where all reviews are in, and all reviews recommend Phase 1 reject, are candidates for Phase 1 rejection." Yet, we got a paper rejected with one of the reviews being accept. cc/ @CsabaSzepesvari

2

0

28

@TheGradient

Hossein Mobahi

3 years

10/10 For more details see , , . Also check papers by @oberman_adam . I decided to write this after reading @docmilanfar 's recent post on bump functions (not Gaussian but related).

Tweet card media

Sharpness-Aware Minimization for Efficiently Improving Generalization

In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly...

4

4

27

@TheGradient

Hossein Mobahi

7 years

Amazing discussion ( @rsalakhu , Liang, Bartlett, Bengio, Kakade) at NIPS-W "DL: Bridging Theory and Practice". A key direction to @prfsanjeevarora is connection between logic/reasoning and differentiable models. Well done @lschmidt3 @maithra_raghu .

Tweet media one

1

7

27

@TheGradient

Hossein Mobahi

5 years

Mikhail Belkin giving a super interesting talk at #ICML #ICML19 #ICML2019 workshop on generalization in deep learning on why over-parameterized models that achieve zero training loss can still generalize. 🔑 is regularization; implicit (e.g. SGD) or explicit (e.g. norm penalty).

Tweet media one

2

4

26

@TheGradient

Hossein Mobahi

4 years

#NeurIPS2020 has extended deadline by 48 hours in support of those affected by recent events . The new deadline is Friday June 5, 2020 at 1pm PDT. Thank you @NeurIPSConf

0

10

27

@TheGradient

Hossein Mobahi

2 years

@YiMaTweets joining twitter took me down the memory lane of my PhD, and the people to whom I dedicated my dissertation: my wife, child, and parents 🌺 This is a journey that depends on you, your adviser, and the people around you at so many levels.

Tweet media one

0

3

27

@TheGradient

Hossein Mobahi

4 years

@roydanroy The third grand child.

1

0

27

@TheGradient

Hossein Mobahi

9 months

. @UMich friends! I will be visiting on Oct 19th to discuss some mysteries of weight noise, gradient penalties, sharpness regularization, Hessian structure, and activation functions. Let's grab a☕! Work by @ynd , A. Agarwala and I. Thanks @qing for hosting.

How Hessian Structure Explains Mysteries in Sharpness Regularization

ece.engin.umich.edu

2

2

26

@TheGradient

Hossein Mobahi

2 years

2/2 SAM was motivated for improving generalization in classification tasks. Why training CLIP with SAM yields a representation subspace of lower dimension is an unexplained side effect; great open problem [for y'all!] with great practical impact. cc/ @HochreiterSepp @jeankaddour

2

4

26

@TheGradient

Hossein Mobahi

3 years

2/5 Then he said this, which has got stuck in mind since then: "I think in the next few years we' see a lot of papers which will show feedback has something significant". Sure do we have RNNs and LSTMs but, those are very limited forms of feedback: short-range and at micro-level.

1

0

26

@TheGradient

Hossein Mobahi

1 year

The growing evidence on benefits of over-parameterization in optimization (attaining lower loss) and generalization (simpler solution in some sense) has turned over-parameterization into free lunch in minds. But, here is twist!

@SimonShaoleiDu

Simon Shaolei Du

@SimonShaoleiDu

1 year

Over-parametrization helps gradient descent find a global minimum. What about the convergence rate? Over-parametrization can make convergence EXPONENTIALLY SLOWER: from exp(-T) to a surprising 1/T^3 rate! Paper: Joint work w. Weihang Xu

Tweet media one

Tweet media two

7

32

312

1

1

26

@TheGradient

Hossein Mobahi

4 years

A great initiative! The author plans to blogpost a series of materials in colab environment about kernel methods and investigate their behavior in hands-on fashion. Seems very neat and accessible. Looking forward to future posts by @blake__bordelon !

@blake__bordelon

Blake Bordelon ☕️🧪👨‍💻

@blake__bordelon

4 years

Ever wondered when/why kernel methods or infinite width neural networks generalize? I wrote a blog post about a recent theory developed with @CPehlevan and @canatar_a that predicts test risk for any kernel or data distribution

0

22

57

0

3

24

@TheGradient

Hossein Mobahi

4 years

We just released document V1.0 about the #NeurIPS2020 generalization competition to cover competition description (datasets, metrics, etc.), timeline, rules and protocols.

@PGDL_NeurIPS

Predicting Generalization in Deep Learning

4 years

We just released the first document of the competition on our website . It covers competition description (datasets, metrics, etc.), timeline, rules and protocols. Hurry up... Phase 1 starts on July 15.

0

9

34

1

7

25

@TheGradient

Hossein Mobahi

5 years

@geoffreyhinton We should build more X staircases everywhere then.

1

1

23

@TheGradient

Hossein Mobahi

4 years

Interesting result by Kumar et al.🔥Many deep RL methods estimate value functions by bootstrapping. By connecting this to self-distillation in RKHS, they show the combo of bootstraping & gradient descent can lead to an "implicit under-parameterization"(low-rank model) phenomenon.

Tweet media one

@svlevine

Sergey Levine

4 years

We've been studying why deep RL is so hard, and we think we have another reason: implicit under-parameterization: Iteratively training on your own targets is a kind of "self-distillation," and leads to loss of rank -> w/ Aviral Kumar @agarwl_ @its_dibya

Tweet media one

3

53

290

0

2

23

@TheGradient

Hossein Mobahi

3 years

A piratical example of how self-distillation blocks overfitting to noise. Image-text pairs from Internet are usually weakly correlated and noise is abundant. Authors show self-distillation w/ soft labels mitigates this noise. By @berkeley_ai (Cheng, @mejoeyg ) and @facebookai .

@_akhaliq

AK

3 years

Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation pdf: abs: model achieves strong performance with only 3M image text pairs, 133x smaller than CLIP

Tweet media one

Tweet media two

0

8

61

1

2

23

@TheGradient

Hossein Mobahi

5 years

Submission ☠┊(deadline) for #ICML2019 workshop "Generalization in Deep Networks" extended to May 23. Contribute to understanding this mysterious and 🎁(surpriseful) area, and join us at 🏖️ (Long Beach) for ☕ (hearing great invited speakers and accepted posters). @icmlconf ☟

@TheGradient

Hossein Mobahi

5 years

A mystery of our era: why/when/how deep networks generalize? Submit a paper or join discussion @icmlconf workshop. Speakers M. Belkin @chelseabfinn @ShamKakade6 @jasondeanlee @aleks_madry @roydanroy . w/ Bartlett @dawnsongtweets Srebro @bneyshabur @dilipkay

0

27

96

1

12

22