Hossein Mobahi Profile Banner
Hossein Mobahi Profile
Hossein Mobahi

@TheGradient

5,909
Followers
709
Following
112
Media
1,300
Statuses

Senior Research Scientist @GoogleDeepMind . I ∈ Optimization ∩ Machine Learning. Fan of @IronMaiden 🤘.Here to discuss research 🤓

Mountain View, CA
Joined December 2010
Don't wanna be here? Send us removal request.
Pinned Tweet
@TheGradient
Hossein Mobahi
8 months
(0/17) Grab your🍿 for a thread on some mysteries and explanations connecting flat minima, second order optimization, weight noise, gradient norm penalty, and activation functions😱 There is also a video presentation if you prefer:
5
74
471
@TheGradient
Hossein Mobahi
3 years
Log(😅) = 💧Log(😄)
45
3K
17K
@TheGradient
Hossein Mobahi
2 years
Today @GoogleAI officially launched the Google Research YouTube channel 🚀 On the channel, we have a range of content including shows such as (1) Meet a Researcher and (2) ResearchBytes, along with (3) Spotlights.
12
228
1K
@TheGradient
Hossein Mobahi
4 years
1/5 Self-Distillation loop (feeding predictions as new target values & retraining) improves test accuracy. But why? We show it induces a regularization that progressively limits # of basis functions used to represent the solution. w/ @farajtabar P.Bartlett
Tweet media one
15
193
825
@TheGradient
Hossein Mobahi
4 years
My PhD adviser Yi Ma and academic brother John Wright just put their new & free (pre-production) book online "High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications" . Organized, easy read, w/ lots of visuals.
Tweet media one
7
129
743
@TheGradient
Hossein Mobahi
3 years
1/5 In July 2016, Jitendra Malik gave an inspiring talk at Google, with a slide showing a block diagram of visual pathway in a primate. He said "there are a lot of feedback loops as you see". He then stressed that, yet the current deep neural architectures are mainly feedforward.
Tweet media one
9
84
473
@TheGradient
Hossein Mobahi
3 years
1/10 Heard about implicit regularization of SGD (i.e. bias toward certain solutions not explicitly stated in the objective function) and wondered why it happens? This thread provides some introductory analysis on why SGD prefers solutions with small norm.
Tweet media one
4
75
361
@TheGradient
Hossein Mobahi
3 years
If scientists had logos! via @vardi
Tweet media one
1
64
355
@TheGradient
Hossein Mobahi
3 years
One of the most touching thesis dedications: "To all the students who had to discontinue their PhD because of toxic work environments, and to all the kind and humble researchers who are striving to make academia a better place." from @_vaishnavh 's PhD thesis @SCSatCMU Aug. 2021.
6
26
300
@TheGradient
Hossein Mobahi
4 years
Are neural networks learning or memorizing? It must be learning, otherwise how they generalize so well. But hey, maybe memorization is not against generalization, and maybe necessary when there's little training data for some classes. by @vitalyFM & Zhang.
Tweet media one
4
46
292
@TheGradient
Hossein Mobahi
5 years
One of the most comprehensive studies of generalization to date; ≈40 complexity measures over ≈10K deep models. Surprising observations worthy of further investigations. Fantastic Generalization Measures: w @yidingjiang @bneyshabur @dilipkay S. Bengio
Tweet media one
3
60
264
@TheGradient
Hossein Mobahi
9 months
In case you missed the news: @GoogleAI 's Student Researcher Program (i.e. internship, read below though) for 2024 is now live and you can apply here: Note: Intern is the same as Student Researcher. The only difference is that the former relates to
6
54
254
@TheGradient
Hossein Mobahi
3 years
1/11 Earlier this year, I promised to write an introductory thread on calculus of variations and its uses in machine learning. The time has arrived! I am not to give a rigorous treatment, but a minimal intuition here. In one line: it is a formalism for seeking optimal functions.
Tweet media one
4
41
240
@TheGradient
Hossein Mobahi
2 months
After several wonderful years at @GoogleAI , I joined @GoogleDeepMind today. I look forward to continuing my work on foundational ML research with this exceptional team. Big thanks to @hugo_larochelle and @ZoubinGhahrama1 for invaluable support.
23
1
193
@TheGradient
Hossein Mobahi
5 years
1/3 DEMOGEN is a dataset of 756 CNN/ResNet-32 models trained on CIFAR-10/100 w/ various regularization and hyperparameters, leading to wide range of generalization behaviors. Hope dataset can help the community w/ exploring generalization in #deeplearning
1
40
173
@TheGradient
Hossein Mobahi
4 years
To my fellow #NeurIPS area chairs, author response is planned to open tomorrow. A LOT of effort goes into writing these responses even though they are short. Please DO read them CAREFULLY and engage in discussion with reviewers. Do not let R2 determine the fate of a good paper.
1
9
153
@TheGradient
Hossein Mobahi
3 years
1/10 Gaussian forms are prevalent in machine learning. Familiar: normal distribution (of stats), positive definite kernels (of RKHS in SVMs), etc. But I like to highlight it from optimization viewpoint, in light of recent generalization gains by flat minima (e.g. SAM optimizer).
4
40
145
@TheGradient
Hossein Mobahi
3 months
Applications for the @GoogleAI PhD Fellowship Program are now open. The program will be accepting student applications through May 8. It supports graduate students doing innovative research in computer science and related fields as they pursue their PhD and also connects them to
0
30
132
@TheGradient
Hossein Mobahi
3 years
Thank you Shafi Golwasser (Turing Award Winner) for your strong statement in support of the Iranian research community in computer science. It means a lot to our community in general, and to young students in particular, who look up to Turing Award Winners as their role models.
@nhaghtal
Nika Haghtalab
3 years
Tweet media one
1
16
145
2
11
134
@TheGradient
Hossein Mobahi
4 years
Among 20 #NeurIPS2020 submissions I am serving as the AC for, one review has been submitted, and it is quite detailed and high quality. Given that reviews are due by July 24th, I am impressed by the timing and quality of this review 😲🙏 Hope this post motivates other reviewers.
5
2
131
@TheGradient
Hossein Mobahi
2 years
Are you a strong PhD student interested in doing cutting edge research at @GoogleAI ? I have an opening for student researcher position to explore open problems and extensions of Sharpness-Aware Minimization (SAM) w/ @bneyshabur . Please refer to .
Tweet media one
4
26
122
@TheGradient
Hossein Mobahi
5 years
Some Papers: remove local minima of neural networks by adding a single unit. Turns Out: similar trick works for any loss function and without auxiliary units. The Catch: these tricks all move local minima to infinity. Nice note by @jaschasd
@jaschasd
Jascha Sohl-Dickstein
5 years
Eliminating All Bad Local Minima from Loss Landscapes Without Even Adding an Extra Unit It's less than one page. It may be deep. It may be trivial. It will definitely help you understand how some claims in recent theory papers could possibly be true.
Tweet media one
6
178
702
3
17
102
@TheGradient
Hossein Mobahi
4 years
Upcoming Open Workshop: The Analytical Foundations of Deep Learning: Interpretability and Performance Guarantees (Oct. 19-21&23) open to everyone to attend. Covering tutorials, presentations, and brainstorming.
1
29
101
@TheGradient
Hossein Mobahi
4 years
1/6 We propose a “unifying framework” for analyzing implicit bias of neural networks, investigate gradient flow on “linear tensor networks,” and extend existing theoretical results while relaxing some of their convergence assumptions.
2
20
99
@TheGradient
Hossein Mobahi
2 years
1/2 Intriguing observation [leading to a concrete research problem] by DALL·E 2 team: "the rank [of covariance matrix] of the CLIP representation space is drastically reduced when training CLIP with SAM while slightly improving evaluation metrics" 🧐
@OpenAI
OpenAI
2 years
Our newest system DALL·E 2 can create realistic images and art from a description in natural language. See it here:
590
3K
11K
2
21
98
@TheGradient
Hossein Mobahi
4 years
To honor the memory of those lost in the crash of Ukraine Int. Airlines flight #PS752 in Tehran, University of Toronto has established an Iranian Student Memorial Scholarship to support Iranian students or students from any background studying Iranian studies. Thank you @UofT !
3
9
91
@TheGradient
Hossein Mobahi
7 years
We just put online all the talks for NIPS 2016 workshop on nonconvex optimizations. Do not miss the amazing talks:
1
27
87
@TheGradient
Hossein Mobahi
3 years
Double column format is an implicit (over-)regularization against writing detailed mathematical formulas.
2
0
84
@TheGradient
Hossein Mobahi
4 years
Dear friends @CS_UCLA : I will be giving a talk at CS 201 seminar next week on the implicit regularization of 🧠⚗️(aka knowledge distillation). Let's grab a ☕ at our 🏠's and discuss about research in this virtual meeting. (cc: @zacharylipton emojis.)
1
12
80
@TheGradient
Hossein Mobahi
4 years
🔥New state-of-the-art performance on ImageNet recorded by our "Sharpness-Award Minimization" work; a very simple yet effective optimization method. Paper Performance . w/ @Foret_p @bneyshabur and Ariel Kleiner.
@Foret_p
Pierre Foret
4 years
Fine-tuning the EfficientNet l2-475 from the noisy student paper with SAM increases the imagenet accuracy from 88.2% to 88.6%, giving a new state or the art! With Ariel Kleiner, @TheGradient , and @bneyshabur
0
8
45
1
11
76
@TheGradient
Hossein Mobahi
4 years
To appear in #NeurIPS2020 .
@TheGradient
Hossein Mobahi
4 years
1/5 Self-Distillation loop (feeding predictions as new target values & retraining) improves test accuracy. But why? We show it induces a regularization that progressively limits # of basis functions used to represent the solution. w/ @farajtabar P.Bartlett
Tweet media one
15
193
825
2
6
72
@TheGradient
Hossein Mobahi
4 years
@_arohan_ @geoffreyhinton The universe will start over from a different random seed.
1
0
64
@TheGradient
Hossein Mobahi
4 years
From #NeurIPS2020 program chairs: After screening for various issues, there are approximately 9300 papers ready for the next phase of the review process.
3
4
62
@TheGradient
Hossein Mobahi
3 years
3/5 Earlier today a new SOTA on ImageNet was reported by @quocleix and his team (90.2% on top-1 accuracy). Their approach relies on a feedback loop between a "pair of architectures" (teacher and student). See their tweet here:
Tweet media one
@quocleix
Quoc Le
3 years
Some nice improvement on ImageNet: 90% top-1 accuracy has been achieved :-) This result is possible by using Meta Pseudo Labels, a semi-supervised learning method, to train EfficientNet-L2. More details here:
Tweet media one
16
304
1K
1
3
62
@TheGradient
Hossein Mobahi
3 years
5/5 Today, a few years after Jitendra's prediction, we witness that SOTA on ImageNet relies on rich and broad-range feedback loops. I think this is just the beginning, and we will see (and hopefully also understand) more about complex feedback loops in machine learning models.
3
2
59
@TheGradient
Hossein Mobahi
3 years
And so #NeurIPS2021 AC nightmare time begins. With reviews due this Friday, I have 80% of reviews still missing despite sending reminders 😱 Wondering what is the situation with fellow AC's. Is around 80% what you also have missing?
18
0
59
@TheGradient
Hossein Mobahi
6 years
We are organizing ICML18 Workshop on "Nonconvex Optimization for ML", covering a broad range of exciting topics in this area. Submission deadline: May 22 2018 (23:59PDT).
1
9
56
@TheGradient
Hossein Mobahi
2 years
Junior faculties... Application for @GoogleAI 's Scholar Program will be open in 10 days (Nov 3rd). It provides unrestricted gifts to support research conducted by early-career professors (received their PhD within 7 years ago). Please spread the word
0
8
55
@TheGradient
Hossein Mobahi
4 years
The 2021 Google PhD Fellowship cycle in North America and Europe will open on September 1st, 2020 instead of November 1st. PhD students must be nominated by their university by September 30, 2020.
0
7
54
@TheGradient
Hossein Mobahi
11 months
Tweet media one
0
4
51
@TheGradient
Hossein Mobahi
4 years
@farajtabar @geoffreyhinton @OriolVinyalsML @JeffDean 3/5 Furlanello @zacharylipton @mtschannen Itti & @AnimaAnandkumar noticed repeating self-distill a few iterations creates models w/ better test accuracy. Puzzling as no external info is fed into system; feedback dynamics of distillation loop somehow evolves toward better models!
Tweet media one
2
7
50
@TheGradient
Hossein Mobahi
3 years
The common pattern I find among many successful people is that they define the opportunity rather than letting opportunity define them. Here is an interesting example from @bneyshabur 's journey to graduate school.
@bneyshabur
Behnam Neyshabur
3 years
Totally agree! Anyone screening applications and any applicant thinking their CV is not representative of their skills/potentials, I think you might want to read the story of my own PhD application in this thread: 1/
30
170
1K
0
1
50
@TheGradient
Hossein Mobahi
2 years
Applications for the 2023 @Google PhD Fellowship awards in North America and Europe will officially open as of today Sep 1 through Oct 6. Apply here and find FAQ (for complete eligibility and application information) here
3
13
50
@TheGradient
Hossein Mobahi
3 years
Classic learning theory (figure) finds overfitting bad, but observations from deep learning contradicts! How overfitting can be benign? Answered in recent overview/analysis by @Andrea__M , P. Bartlett and S. Rakhlin. (analyzes linear models, but may relate to deep network via NTK)
Tweet media one
@Andrea__M
Andrea Montanari
3 years
With Peter Bartlett and Sasha Rakhlin, an overview of a slice of research in theoretical machine learning (and an effort to connect some dots):
2
74
379
2
7
49
@TheGradient
Hossein Mobahi
4 years
Great news for transfer learning researchers. AFAIK, there hasn't been any solid benchmark for transfer across classes with a natural hierarchical structure (adapt a model trained on a leaf class to another leaf class related via a common ancestor). Very much needed has arrived.
@aleks_madry
Aleksander Madry
4 years
Can our models classify Dalmatians as "dogs" even if they are only trained on Poodles? We release a suite of benchmarks for testing model generalization to unseen data subpopulations. w/ @tsiprasd @ShibaniSan Blog (+data): Paper:
Tweet media one
Tweet media two
2
36
177
0
8
49
@TheGradient
Hossein Mobahi
3 years
Have you ever wanted to play with self-distillation and better understand its regularization effect, but didn't have time to write code yourself? Today is your day. @KennethBorup has shared a neat and minimal code that might be a great starting point.
0
9
46
@TheGradient
Hossein Mobahi
1 year
A bit of story. Once upon a time, I was a researcher and Phillip was a PhD student, in the same lab at MIT. Every student in the lab was incredible in some way. Phillip's way: wanted to and could go very very deep into seemingly simple research questions. His blog must be a gem.
@phillip_isola
Phillip Isola
1 year
I give a lot of talks but often only a few people see them. I’m going to try a new experiment where I write blog versions of some of the talks I give. Here’s the first one: A short, general-audience intro to "Generative Models of Images" -->
Tweet media one
5
52
389
2
0
47
@TheGradient
Hossein Mobahi
4 years
Springer has introduced COVID-19 quarantine package offering about 500 textbooks *** free *** to download. Of particular interest to me (and perhose you) are engineering, math, and computer science books among them.
1
14
47
@TheGradient
Hossein Mobahi
4 years
#NeurIPS2020 attendees: I will be 🤩 to see you Wed Dec 9th, 9:00am-11:00am (PST) to have a fun discussion on pseudo-labeling by self-distillation and its implicit regularization effects. Find us here👉 . With @farajtabar and Peter Bartlett at @NeurIPSConf .
Tweet media one
1
5
47
@TheGradient
Hossein Mobahi
2 years
Excited to be chairing an oral session with all ace presentations (including 3 best papers) at @iclr_conf . Looking forward to seeing you and hearing your questions in the #ICLR2022 oral session 2 on Understanding Deep Learning [Apr 26 8am GMT (1am PST)].
0
7
47
@TheGradient
Hossein Mobahi
3 years
At #ICLR2021 spotlights next week, we will present Sharpness-Aware Minimization (SAM): a PRINCIPLED optimization method, that is EASY to implement, and achieves SOTA on several standard benchmarks (SOTA ref ). with @Foret_p Airel Kleiner @bneyshabur .
Tweet media one
@bneyshabur
Behnam Neyshabur
3 years
Sharpness-Aware Minimization for Efficiently Improving Generalization (Spotlight at #ICLR2021 ) with @Foret_p , Ariel Kleiber and @TheGradient Paper: Code: Video and Poster: 3/7
Tweet media one
2
4
20
0
8
47
@TheGradient
Hossein Mobahi
1 year
After 12 years of being on Twitter, this is my first non-technical post. Congratulations to Shervin Hajipour, Iranian Women, and all the brave people fighting for Freedom of Iran. This song brings me to tears every time time I listen to.
@Omid_M
Omid Memarian
1 year
Incredible! #ShervinHajipour wins the best "Social Change" song at @RecordingAcad ( #GRAMMYs ) for “Baraye," a song that captured the hearts of millions in Iran and around the world. #MahsaAmini #مهسا_امینی #شروین_حاجی‌پور
88
2K
7K
1
2
44
@TheGradient
Hossein Mobahi
2 years
@bneyshabur I see your point, but building/understanding are distinct goals, and which one to pursue a matter of personal taste. Engineering may come up with cool working ideas, but not necessarily an explanation for why and how they work. For the latter, you need to build up on foundations.
2
0
44
@TheGradient
Hossein Mobahi
3 years
4/5 Such feedback loops across neural networks as we see here, or similarly see in self-distillation methods (figure below) are new and operate at much larger scale than those micro feedbacks in RNNs/LSTMs. They have shown to be a new and promising way to improve generalization.
Tweet media one
1
3
44
@TheGradient
Hossein Mobahi
2 years
🔥🔥Sharpness Aware Minimization (SAM) is now available in Keras🔥🔥 Would be great if people try it out and provide feedback to me or the Keras team about any potential issues. Big thanks to @fchollet and his team for making this happen.
2
10
44
@TheGradient
Hossein Mobahi
4 years
#NeurIPS program chairs take reviewer's performance into account: The top highest-scoring reviewers will be awarded free NeurIPS registrations and will be highlighted in our website. The lowest-scoring reviewers may instead not be invited to review for future conferences.
4
1
43
@TheGradient
Hossein Mobahi
5 years
It's said that deep learning is data hungry, but does it really use every piece of training data in common datasets? Turns out you can discard 10% of ImageNet training set without any drop in test accuracy.
Tweet media one
2
6
42
@TheGradient
Hossein Mobahi
4 years
@farajtabar @geoffreyhinton @OriolVinyalsML @JeffDean @zacharylipton @mtschannen @AnimaAnandkumar 5/5 The regularization gets more aggressive (sparser representation) after each self-distill round. Thus, model may initially benefit & avoid overfitting, but excessive regularization leads to underfitting. Consistent empirically on deepnets: test accuracy first up and then down.
Tweet media one
3
4
41
@TheGradient
Hossein Mobahi
3 years
Excited to see Sharpness-Aware Minimization (SAM optimizer) we have proposed recently (w/ @Foret_p @bneyshabur and Kleiner) is becoming a persistent component in recent state-of-the-art records 😇
@_akhaliq
AK
3 years
Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error pdf: abs: ImageNet SOTA of 86.8% top-1 accuracy after just 34 epochs of training with an NFNet-F5 using the SAM optimizer
Tweet media one
Tweet media two
1
23
85
0
7
41
@TheGradient
Hossein Mobahi
5 years
@roydanroy We have come a long way actually; from elegant ideas to grid search over hyper parameter 😂
0
0
42
@TheGradient
Hossein Mobahi
8 months
@vipul_1011 Sorry to hear that! Here is my “personal” perspective on the matter (the last step in the intern selection process) as a person who regularly interviews intern candidates. There is a huge pile of great candidates and a limited number of openings. One thing that intern hosts hope
4
1
42
@TheGradient
Hossein Mobahi
1 year
@GoogleAI just opened applications for the Award for Inclusion Research program that supports academic research in computing and technology addressing the needs of historically marginalized groups and creates positive societal impact. Apply by July 13 at
0
15
40
@TheGradient
Hossein Mobahi
5 years
For quite sometime (NeurIPS18, ICLR19), we have empirically observed margin at intermediate layers carries significant information about generalization of a deep model. Delighted to see @tengyu has now proved this phenomenon, and provided a cleaner definition of all-layer margin.
@tengyuma
Tengyu Ma
5 years
A new paper on improving the generalization of deep models (w.r.t clean or robust accuracy) by theory-inspired explicit regularizers.
0
84
419
0
3
39
@TheGradient
Hossein Mobahi
2 years
My former PhD adviser @YiMaTweets joined twitter today. Looking forward to tweets about his (unique and rigorous style of) research on ML. Welcome to twitter Yi!
@TheGradient
Hossein Mobahi
4 years
My PhD adviser Yi Ma and academic brother John Wright just put their new & free (pre-production) book online "High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications" . Organized, easy read, w/ lots of visuals.
Tweet media one
7
129
743
0
2
40
@TheGradient
Hossein Mobahi
4 years
I see a rapidly growing success from "self-training" and "self-distillation" type methods recently. There is a lot of opportunity there for theoretical understanding and explanations with huge practical impact as these methods are now at the core to some SOTA models.
@quocleix
Quoc Le
4 years
We researchers love pre-training. Our new paper shows that pre-training is unhelpful when we have a lot of labeled data. In contrast, self-training works well even when we have a lot of labeled data. SOTA on PASCAL segmentation & COCO detection. Link:
Tweet media one
7
229
980
1
2
38
@TheGradient
Hossein Mobahi
4 years
Understanding and predicting generalization performance in deep neural networks with emphasis on "robustness" of predictions. Valuable findings on success and failure of different complexity measures when analyzed via the lens of robust predictions; with both good and bad news.
Tweet media one
@alexandredrouin
Alexandre Drouin
4 years
#NeurIPS2020 paper "In Search of Robust Measures of Generalization" evaluates robustness of generalization theories. None is robust, nevermind fantastic! 😭 w/ @KDziugaite @CasualBrady @nitarshan @ethancaballero Linbo Wang @bouzoukipunks @roydanroy
Tweet media one
1
29
103
0
7
39
@TheGradient
Hossein Mobahi
6 years
Can generalization gap in deep networks be "accurately" predicted from trained_weights plus training_data? New evidence says "Yes" via normalized margin distributions at all layers.
Tweet media one
0
9
39
@TheGradient
Hossein Mobahi
3 years
Summary of Reviewing Policy for #NeurIPS 2021: (1) Review in OpenReview. (2) Rejected submissions won't be made public unless authors want so. (3) No desk rejection. (4) Rolling discussion between authors and reviewers after initial reviews and author responses are submitted.
@NeurIPSConf
NeurIPS Conference
3 years
Check out the updates to the reviewing process for NeurIPS 2021!
2
49
152
0
9
39
@TheGradient
Hossein Mobahi
4 years
🥇For the first time, there is now a competition for predicting generalization performance of deep learning @NeurIPSConf #NeurIPS2020 . With @PGDL_NeurIPS . Do not miss the opportunity 😀
@PGDL_NeurIPS
Predicting Generalization in Deep Learning
4 years
Do you have a theory or a hypothesis that can predict generalization performance in deep learning? Now, for the first time, there is a competition for it & you can validate your idea @NeurIPSConf #NeurIPS2020 . We hope this will instigate progress in understanding generalization.
5
54
165
1
7
38
@TheGradient
Hossein Mobahi
4 years
@farajtabar 2/5 Knowledge distillation by @geoffreyhinton @OriolVinyalsML @JeffDean originally motivated to transfer knowledge from large to smaller networks. Self-distillation is special case with identical architectures; predictions of model are fed back to itself as new target values.
1
1
34
@TheGradient
Hossein Mobahi
4 years
Yay! @NeurIPSConf extended submission deadline for #NeurIPS2020 . Abstract deadline: May 27. Paper deadline: June 3.
0
5
35
@TheGradient
Hossein Mobahi
2 years
On an unrelated note, I have been fortunate to be hosting this smart person @_christinabaek this summer at @GoogleAI and her coauthor, the incredible @yidingjiang as an AI resident at @GoogleAI in the past. Please keep expanding your lab @zicokolter 😀
@_christinabaek
Christina Baek
2 years
Estimating out-of-distribution (OOD) performance is hard because labeled data is expensive. Can we predict OOD performance w/ only _unlabeled data_? In our work (), we show this can be done using models’ agreement. w/ @yidingjiang , Aditi R., @zicokolter
12
82
516
2
2
34
@TheGradient
Hossein Mobahi
1 year
Applications are open for the @Google Carbon Removal Research Awards, a program looking to fund selected academic research efforts to better characterize and accelerate the development of new carbon removal approaches. Apply by April 28! Learn more:
1
10
31
@TheGradient
Hossein Mobahi
1 year
First observed by DALL·E2 of @OpenAI , SAM optimizer tends to produce lower-rank representations compared to SGD. Yet explanation for this phenomenon remained missing till today. But wait... @maksym_andr just flipped it around! Pleasure to work w/ him, @dara_bahri , N. Flammarion.
@maksym_andr
Maksym Andriushchenko
1 year
🚨Excited to share our new work “Sharpness-Aware Minimization Leads to Low-Rank Features” ! ❓We know SAM improves generalization, but can we better understand the structure of features learned by SAM? (with @dara_bahri , @TheGradient , N. Flammarion) 🧵1/n
Tweet media one
4
24
157
0
4
32
@TheGradient
Hossein Mobahi
5 years
Excited to give an invited talk at @ipam_ucla 's workshop "Partial Differential Equations & Inverse Problem Methods in Machine Learning" April 2020. And no, I won't talk about heat equation this time (global warming?), but Poisson's equation. Details soon 😎
1
5
32
@TheGradient
Hossein Mobahi
3 years
@deviparikh Ugh! They seem to be aggressively replacing everything.
2
0
31
@TheGradient
Hossein Mobahi
4 years
@farajtabar @geoffreyhinton @OriolVinyalsML @JeffDean @zacharylipton @mtschannen @AnimaAnandkumar 🔥🔥NEW🔥🔥"Self-Distillation Amplifies Regularization in Hilbert Space" updated on arXiv ! Changes: 🧹🧹cleaned up proofs🧹🧹 + 💾💾added codes (in appendix) for illustrative example and deep learning experiments💾💾.
Tweet media one
1
2
30
@TheGradient
Hossein Mobahi
4 years
The end of double-descent era and the beginning of multi-descent era for generalization curves. This work proves multi-descent curves exist. In fact, it tells you how to shape them up in the way you wish [for linear regression]! Very interesting work @aminkarbasi 🤓
Tweet media one
@aminkarbasi
Amin Karbasi
4 years
Can the existence of a multiple descent generalization curve be rigorously proven? Can an arbitrary number of descents occur? Can the generalization curve and the locations of descents be designed? We answer yes to all three of these questions.
Tweet media one
5
38
199
1
2
29
@TheGradient
Hossein Mobahi
6 years
ICML Workshop on Nonconvex Optimization was a great success thanks to 7 fantastic speakers (including @HazanPrinceton @svlevine ), 34 high quality (accepted) papers (online: ) and great audience. @AnimaAnandkumar
0
7
29
@TheGradient
Hossein Mobahi
3 years
Bonus/10: A (1.5-page) PDF version of tweet can be also found below.
4
1
29
@TheGradient
Hossein Mobahi
2 years
New theory on SAM: 1. Provides Convergence Analysis: highly non-trivial for SAM due to 1/||∇L|| in the update rule (intricately couples coordinates). 2. ∇³L: We know by two ∇'s, SAM efficiently approximates ∇² (curvature). This work shows SAM implicitly utilizes ∇³ as well!
@philipmlong
Phil Long
2 years
New paper with Peter Bartlett and @obousquet called "The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima": .
1
9
45
0
0
28
@TheGradient
Hossein Mobahi
2 years
Disappointed with fellow AC's decision #icml2022 . AC instruction: "Only papers where all reviews are in, and all reviews recommend Phase 1 reject, are candidates for Phase 1 rejection." Yet, we got a paper rejected with one of the reviews being accept. cc/ @CsabaSzepesvari
2
0
28
@TheGradient
Hossein Mobahi
7 years
Amazing discussion ( @rsalakhu , Liang, Bartlett, Bengio, Kakade) at NIPS-W "DL: Bridging Theory and Practice". A key direction to @prfsanjeevarora is connection between logic/reasoning and differentiable models. Well done @lschmidt3 @maithra_raghu .
Tweet media one
1
7
27
@TheGradient
Hossein Mobahi
5 years
Mikhail Belkin giving a super interesting talk at #ICML #ICML19 #ICML2019 workshop on generalization in deep learning on why over-parameterized models that achieve zero training loss can still generalize. 🔑 is regularization; implicit (e.g. SGD) or explicit (e.g. norm penalty).
Tweet media one
2
4
26
@TheGradient
Hossein Mobahi
4 years
#NeurIPS2020 has extended deadline by 48 hours in support of those affected by recent events . The new deadline is Friday June 5, 2020 at 1pm PDT. Thank you @NeurIPSConf
0
10
27
@TheGradient
Hossein Mobahi
2 years
@YiMaTweets joining twitter took me down the memory lane of my PhD, and the people to whom I dedicated my dissertation: my wife, child, and parents 🌺 This is a journey that depends on you, your adviser, and the people around you at so many levels.
Tweet media one
0
3
27
@TheGradient
Hossein Mobahi
4 years
@roydanroy The third grand child.
1
0
27
@TheGradient
Hossein Mobahi
9 months
. @UMich friends! I will be visiting on Oct 19th to discuss some mysteries of weight noise, gradient penalties, sharpness regularization, Hessian structure, and activation functions. Let's grab a☕! Work by @ynd , A. Agarwala and I. Thanks @qing for hosting.
2
2
26
@TheGradient
Hossein Mobahi
2 years
2/2 SAM was motivated for improving generalization in classification tasks. Why training CLIP with SAM yields a representation subspace of lower dimension is an unexplained side effect; great open problem [for y'all!] with great practical impact. cc/ @HochreiterSepp @jeankaddour
2
4
26
@TheGradient
Hossein Mobahi
3 years
2/5 Then he said this, which has got stuck in mind since then: "I think in the next few years we' see a lot of papers which will show feedback has something significant". Sure do we have RNNs and LSTMs but, those are very limited forms of feedback: short-range and at micro-level.
1
0
26
@TheGradient
Hossein Mobahi
1 year
The growing evidence on benefits of over-parameterization in optimization (attaining lower loss) and generalization (simpler solution in some sense) has turned over-parameterization into free lunch in minds. But, here is twist!
@SimonShaoleiDu
Simon Shaolei Du
1 year
Over-parametrization helps gradient descent find a global minimum. What about the convergence rate? Over-parametrization can make convergence EXPONENTIALLY SLOWER: from exp(-T) to a surprising 1/T^3 rate! Paper: Joint work w. Weihang Xu
Tweet media one
Tweet media two
7
32
312
1
1
26
@TheGradient
Hossein Mobahi
4 years
A great initiative! The author plans to blogpost a series of materials in colab environment about kernel methods and investigate their behavior in hands-on fashion. Seems very neat and accessible. Looking forward to future posts by @blake__bordelon !
@blake__bordelon
Blake Bordelon ☕️🧪👨‍💻
4 years
Ever wondered when/why kernel methods or infinite width neural networks generalize? I wrote a blog post about a recent theory developed with @CPehlevan and @canatar_a that predicts test risk for any kernel or data distribution
0
22
57
0
3
24
@TheGradient
Hossein Mobahi
4 years
We just released document V1.0 about the #NeurIPS2020 generalization competition to cover competition description (datasets, metrics, etc.), timeline, rules and protocols.
@PGDL_NeurIPS
Predicting Generalization in Deep Learning
4 years
We just released the first document of the competition on our website . It covers competition description (datasets, metrics, etc.), timeline, rules and protocols. Hurry up... Phase 1 starts on July 15.
0
9
34
1
7
25
@TheGradient
Hossein Mobahi
5 years
@geoffreyhinton We should build more X staircases everywhere then.
1
1
23
@TheGradient
Hossein Mobahi
4 years
Interesting result by Kumar et al.🔥Many deep RL methods estimate value functions by bootstrapping. By connecting this to self-distillation in RKHS, they show the combo of bootstraping & gradient descent can lead to an "implicit under-parameterization"(low-rank model) phenomenon.
Tweet media one
@svlevine
Sergey Levine
4 years
We've been studying why deep RL is so hard, and we think we have another reason: implicit under-parameterization: Iteratively training on your own targets is a kind of "self-distillation," and leads to loss of rank -> w/ Aviral Kumar @agarwl_ @its_dibya
Tweet media one
3
53
290
0
2
23
@TheGradient
Hossein Mobahi
3 years
A piratical example of how self-distillation blocks overfitting to noise. Image-text pairs from Internet are usually weakly correlated and noise is abundant. Authors show self-distillation w/ soft labels mitigates this noise. By @berkeley_ai (Cheng, @mejoeyg ) and @facebookai .
@_akhaliq
AK
3 years
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation pdf: abs: model achieves strong performance with only 3M image text pairs, 133x smaller than CLIP
Tweet media one
Tweet media two
0
8
61
1
2
23
@TheGradient
Hossein Mobahi
5 years
Submission ☠┊(deadline) for #ICML2019 workshop "Generalization in Deep Networks" extended to May 23. Contribute to understanding this mysterious and 🎁(surpriseful) area, and join us at 🏖️ (Long Beach) for ☕ (hearing great invited speakers and accepted posters). @icmlconf
@TheGradient
Hossein Mobahi
5 years
A mystery of our era: why/when/how deep networks generalize? Submit a paper or join discussion @icmlconf workshop. Speakers M. Belkin @chelseabfinn @ShamKakade6 @jasondeanlee @aleks_madry @roydanroy . w/ Bartlett @dawnsongtweets Srebro @bneyshabur @dilipkay
0
27
96
1
12
22