(0/17) Grab your🍿 for a thread on some mysteries and explanations connecting flat minima, second order optimization, weight noise, gradient norm penalty, and activation functions😱 There is also a video presentation if you prefer:
Today
@GoogleAI
officially launched the Google Research YouTube channel 🚀 On the channel, we have a range of content including shows such as (1) Meet a Researcher and (2) ResearchBytes, along with (3) Spotlights.
1/5 Self-Distillation loop (feeding predictions as new target values & retraining) improves test accuracy. But why? We show it induces a regularization that progressively limits # of basis functions used to represent the solution. w/
@farajtabar
P.Bartlett
My PhD adviser Yi Ma and academic brother John Wright just put their new & free (pre-production) book online "High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications" . Organized, easy read, w/ lots of visuals.
1/5 In July 2016, Jitendra Malik gave an inspiring talk at Google, with a slide showing a block diagram of visual pathway in a primate. He said "there are a lot of feedback loops as you see". He then stressed that, yet the current deep neural architectures are mainly feedforward.
1/10 Heard about implicit regularization of SGD (i.e. bias toward certain solutions not explicitly stated in the objective function) and wondered why it happens? This thread provides some introductory analysis on why SGD prefers solutions with small norm.
One of the most touching thesis dedications: "To all the students who had to discontinue their PhD because of toxic work environments, and to all the kind and humble researchers who are striving to make academia a better place." from
@_vaishnavh
's PhD thesis
@SCSatCMU
Aug. 2021.
Are neural networks learning or memorizing? It must be learning, otherwise how they generalize so well. But hey, maybe memorization is not against generalization, and maybe necessary when there's little training data for some classes. by
@vitalyFM
& Zhang.
One of the most comprehensive studies of generalization to date; ≈40 complexity measures over ≈10K deep models. Surprising observations worthy of further investigations. Fantastic Generalization Measures: w
@yidingjiang
@bneyshabur
@dilipkay
S. Bengio
In case you missed the news:
@GoogleAI
's Student Researcher Program (i.e. internship, read below though) for 2024 is now live and you can apply here:
Note: Intern is the same as Student Researcher. The only difference is that the former relates to
1/11 Earlier this year, I promised to write an introductory thread on calculus of variations and its uses in machine learning. The time has arrived! I am not to give a rigorous treatment, but a minimal intuition here. In one line: it is a formalism for seeking optimal functions.
1/3 DEMOGEN is a dataset of 756 CNN/ResNet-32 models trained on CIFAR-10/100 w/ various regularization and hyperparameters, leading to wide range of generalization behaviors. Hope dataset can help the community w/ exploring generalization in
#deeplearning
To my fellow
#NeurIPS
area chairs, author response is planned to open tomorrow. A LOT of effort goes into writing these responses even though they are short. Please DO read them CAREFULLY and engage in discussion with reviewers. Do not let R2 determine the fate of a good paper.
1/10 Gaussian forms are prevalent in machine learning. Familiar: normal distribution (of stats), positive definite kernels (of RKHS in SVMs), etc. But I like to highlight it from optimization viewpoint, in light of recent generalization gains by flat minima (e.g. SAM optimizer).
Applications for the
@GoogleAI
PhD Fellowship Program are now open. The program will be accepting student applications through May 8. It supports graduate students doing innovative research in computer science and related fields as they pursue their PhD and also connects them to
Thank you Shafi Golwasser (Turing Award Winner) for your strong statement in support of the Iranian research community in computer science. It means a lot to our community in general, and to young students in particular, who look up to Turing Award Winners as their role models.
Among 20
#NeurIPS2020
submissions I am serving as the AC for, one review has been submitted, and it is quite detailed and high quality. Given that reviews are due by July 24th, I am impressed by the timing and quality of this review 😲🙏 Hope this post motivates other reviewers.
Are you a strong PhD student interested in doing cutting edge research at
@GoogleAI
? I have an opening for student researcher position to explore open problems and extensions of Sharpness-Aware Minimization (SAM) w/
@bneyshabur
. Please refer to .
Some Papers: remove local minima of neural networks by adding a single unit. Turns Out: similar trick works for any loss function and without auxiliary units. The Catch: these tricks all move local minima to infinity. Nice note by
@jaschasd
Eliminating All Bad Local Minima from Loss Landscapes Without Even Adding an Extra Unit
It's less than one page. It may be deep. It may be trivial. It will definitely help you understand how some claims in recent theory papers could possibly be true.
Upcoming Open Workshop: The Analytical Foundations of Deep Learning: Interpretability and Performance Guarantees (Oct. 19-21&23) open to everyone to attend. Covering tutorials, presentations, and brainstorming.
1/6 We propose a “unifying framework” for analyzing implicit bias of neural networks, investigate gradient flow on “linear tensor networks,” and extend existing theoretical results while relaxing some of their convergence assumptions.
1/2 Intriguing observation [leading to a concrete research problem] by DALL·E 2 team: "the rank [of covariance matrix] of the CLIP representation space is drastically reduced when training CLIP with SAM while slightly improving evaluation metrics" 🧐
To honor the memory of those lost in the crash of Ukraine Int. Airlines flight
#PS752
in Tehran, University of Toronto has established an Iranian Student Memorial Scholarship to support Iranian students or students from any background studying Iranian studies. Thank you
@UofT
!
Dear friends
@CS_UCLA
: I will be giving a talk at CS 201 seminar next week on the implicit regularization of 🧠⚗️(aka knowledge distillation). Let's grab a ☕ at our 🏠's and discuss about research in this virtual meeting.
(cc:
@zacharylipton
emojis.)
🔥New state-of-the-art performance on ImageNet recorded by our "Sharpness-Award Minimization" work; a very simple yet effective optimization method. Paper Performance . w/
@Foret_p
@bneyshabur
and Ariel Kleiner.
Fine-tuning the EfficientNet l2-475 from the noisy student paper with SAM increases the imagenet accuracy from 88.2% to 88.6%, giving a new state or the art!
With Ariel Kleiner,
@TheGradient
, and
@bneyshabur
1/5 Self-Distillation loop (feeding predictions as new target values & retraining) improves test accuracy. But why? We show it induces a regularization that progressively limits # of basis functions used to represent the solution. w/
@farajtabar
P.Bartlett
From
#NeurIPS2020
program chairs: After screening for various issues, there are approximately 9300 papers ready for the next phase of the review process.
3/5 Earlier today a new SOTA on ImageNet was reported by
@quocleix
and his team (90.2% on top-1 accuracy). Their approach relies on a feedback loop between a "pair of architectures" (teacher and student). See their tweet here:
Some nice improvement on ImageNet: 90% top-1 accuracy has been achieved :-)
This result is possible by using Meta Pseudo Labels, a semi-supervised learning method, to train EfficientNet-L2.
More details here:
5/5 Today, a few years after Jitendra's prediction, we witness that SOTA on ImageNet relies on rich and broad-range feedback loops. I think this is just the beginning, and we will see (and hopefully also understand) more about complex feedback loops in machine learning models.
And so
#NeurIPS2021
AC nightmare time begins. With reviews due this Friday, I have 80% of reviews still missing despite sending reminders 😱 Wondering what is the situation with fellow AC's. Is around 80% what you also have missing?
We are organizing ICML18 Workshop on "Nonconvex Optimization for ML", covering a broad range of exciting topics in this area. Submission deadline: May 22 2018 (23:59PDT).
Junior faculties... Application for
@GoogleAI
's Scholar Program will be open in 10 days (Nov 3rd). It provides unrestricted gifts to support research conducted by early-career professors (received their PhD within 7 years ago). Please spread the word
The 2021 Google PhD Fellowship cycle in North America and Europe will open on September 1st, 2020 instead of November 1st. PhD students must be nominated by their university by September 30, 2020.
The common pattern I find among many successful people is that they define the opportunity rather than letting opportunity define them. Here is an interesting example from
@bneyshabur
's journey to graduate school.
Totally agree!
Anyone screening applications and any applicant thinking their CV is not representative of their skills/potentials, I think you might want to read the story of my own PhD application in this thread:
1/
Applications for the 2023
@Google
PhD Fellowship awards in North America and Europe will officially open as of today Sep 1 through Oct 6. Apply here and find FAQ (for complete eligibility and application information) here
Classic learning theory (figure) finds overfitting bad, but observations from deep learning contradicts! How overfitting can be benign? Answered in recent overview/analysis by
@Andrea__M
, P. Bartlett and S. Rakhlin. (analyzes linear models, but may relate to deep network via NTK)
Great news for transfer learning researchers. AFAIK, there hasn't been any solid benchmark for transfer across classes with a natural hierarchical structure (adapt a model trained on a leaf class to another leaf class related via a common ancestor). Very much needed has arrived.
Can our models classify Dalmatians as "dogs" even if they are only trained on Poodles? We release a suite of benchmarks for testing model generalization to unseen data subpopulations. w/
@tsiprasd
@ShibaniSan
Blog (+data):
Paper:
Have you ever wanted to play with self-distillation and better understand its regularization effect, but didn't have time to write code yourself? Today is your day.
@KennethBorup
has shared a neat and minimal code that might be a great starting point.
A bit of story. Once upon a time, I was a researcher and Phillip was a PhD student, in the same lab at MIT. Every student in the lab was incredible in some way. Phillip's way: wanted to and could go very very deep into seemingly simple research questions. His blog must be a gem.
I give a lot of talks but often only a few people see them. I’m going to try a new experiment where I write blog versions of some of the talks I give.
Here’s the first one:
A short, general-audience intro to "Generative Models of Images" -->
Springer has introduced COVID-19 quarantine package offering about 500 textbooks *** free *** to download. Of particular interest to me (and perhose you) are engineering, math, and computer science books among them.
#NeurIPS2020
attendees: I will be 🤩 to see you Wed Dec 9th, 9:00am-11:00am (PST) to have a fun discussion on pseudo-labeling by self-distillation and its implicit regularization effects. Find us here👉 . With
@farajtabar
and Peter Bartlett at
@NeurIPSConf
.
Excited to be chairing an oral session with all ace presentations (including 3 best papers) at
@iclr_conf
. Looking forward to seeing you and hearing your questions in the
#ICLR2022
oral session 2 on Understanding Deep Learning [Apr 26 8am GMT (1am PST)].
At
#ICLR2021
spotlights next week, we will present Sharpness-Aware Minimization (SAM): a PRINCIPLED optimization method, that is EASY to implement, and achieves SOTA on several standard benchmarks (SOTA ref ). with
@Foret_p
Airel Kleiner
@bneyshabur
.
Sharpness-Aware Minimization for Efficiently Improving Generalization (Spotlight at
#ICLR2021
)
with
@Foret_p
, Ariel Kleiber and
@TheGradient
Paper:
Code:
Video and Poster:
3/7
After 12 years of being on Twitter, this is my first non-technical post. Congratulations to Shervin Hajipour, Iranian Women, and all the brave people fighting for Freedom of Iran. This song brings me to tears every time time I listen to.
@bneyshabur
I see your point, but building/understanding are distinct goals, and which one to pursue a matter of personal taste. Engineering may come up with cool working ideas, but not necessarily an explanation for why and how they work. For the latter, you need to build up on foundations.
4/5 Such feedback loops across neural networks as we see here, or similarly see in self-distillation methods (figure below) are new and operate at much larger scale than those micro feedbacks in RNNs/LSTMs. They have shown to be a new and promising way to improve generalization.
🔥🔥Sharpness Aware Minimization (SAM) is now available in Keras🔥🔥
Would be great if people try it out and provide feedback to me or the Keras team about any potential issues. Big thanks to
@fchollet
and his team for making this happen.
#NeurIPS
program chairs take reviewer's performance into account: The top highest-scoring reviewers will be awarded free NeurIPS registrations and will be highlighted in our website. The lowest-scoring reviewers may instead not be invited to review for future conferences.
It's said that deep learning is data hungry, but does it really use every piece of training data in common datasets? Turns out you can discard 10% of ImageNet training set without any drop in test accuracy.
Excited to see Sharpness-Aware Minimization (SAM optimizer) we have proposed recently (w/
@Foret_p
@bneyshabur
and Kleiner) is becoming a persistent component in recent state-of-the-art records 😇
Drawing Multiple Augmentation Samples Per Image
During Training Efficiently Decreases Test Error
pdf:
abs:
ImageNet SOTA of 86.8% top-1 accuracy after just 34 epochs of training with an NFNet-F5 using the SAM optimizer
@vipul_1011
Sorry to hear that! Here is my “personal” perspective on the matter (the last step in the intern selection process) as a person who regularly interviews intern candidates. There is a huge pile of great candidates and a limited number of openings. One thing that intern hosts hope
@GoogleAI
just opened applications for the Award for Inclusion Research program that supports academic research in computing and technology addressing the needs of historically marginalized groups and creates positive societal impact. Apply by July 13 at
For quite sometime (NeurIPS18, ICLR19), we have empirically observed margin at intermediate layers carries significant information about generalization of a deep model. Delighted to see
@tengyu
has now proved this phenomenon, and provided a cleaner definition of all-layer margin.
My former PhD adviser
@YiMaTweets
joined twitter today. Looking forward to tweets about his (unique and rigorous style of) research on ML. Welcome to twitter Yi!
My PhD adviser Yi Ma and academic brother John Wright just put their new & free (pre-production) book online "High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications" . Organized, easy read, w/ lots of visuals.
I see a rapidly growing success from "self-training" and "self-distillation" type methods recently. There is a lot of opportunity there for theoretical understanding and explanations with huge practical impact as these methods are now at the core to some SOTA models.
We researchers love pre-training. Our new paper shows that pre-training is unhelpful when we have a lot of labeled data. In contrast, self-training works well even when we have a lot of labeled data. SOTA on PASCAL segmentation & COCO detection.
Link:
Understanding and predicting generalization performance in deep neural networks with emphasis on "robustness" of predictions. Valuable findings on success and failure of different complexity measures when analyzed via the lens of robust predictions; with both good and bad news.
Can generalization gap in deep networks be "accurately" predicted from trained_weights plus training_data? New evidence says "Yes" via normalized margin distributions at all layers.
Summary of Reviewing Policy for
#NeurIPS
2021: (1) Review in OpenReview. (2) Rejected submissions won't be made public unless authors want so. (3) No desk rejection. (4) Rolling discussion between authors and reviewers after initial reviews and author responses are submitted.
🥇For the first time, there is now a competition for predicting generalization performance of deep learning
@NeurIPSConf
#NeurIPS2020
. With
@PGDL_NeurIPS
. Do not miss the opportunity 😀
Do you have a theory or a hypothesis that can predict generalization performance in deep learning? Now, for the first time, there is a competition for it & you can validate your idea
@NeurIPSConf
#NeurIPS2020
. We hope this will instigate progress in understanding generalization.
@farajtabar
2/5 Knowledge distillation by
@geoffreyhinton
@OriolVinyalsML
@JeffDean
originally motivated to transfer knowledge from large to smaller networks. Self-distillation is special case with identical architectures; predictions of model are fed back to itself as new target values.
Estimating out-of-distribution (OOD) performance is hard because labeled data is expensive. Can we predict OOD performance w/ only _unlabeled data_? In our work (), we show this can be done using models’ agreement.
w/
@yidingjiang
, Aditi R.,
@zicokolter
Applications are open for the
@Google
Carbon Removal Research Awards, a program looking to fund selected academic research efforts to better characterize and accelerate the development of new carbon removal approaches. Apply by April 28! Learn more:
First observed by DALL·E2 of
@OpenAI
, SAM optimizer tends to produce lower-rank representations compared to SGD. Yet explanation for this phenomenon remained missing till today. But wait...
@maksym_andr
just flipped it around! Pleasure to work w/ him,
@dara_bahri
, N. Flammarion.
🚨Excited to share our new work “Sharpness-Aware Minimization Leads to Low-Rank Features” !
❓We know SAM improves generalization, but can we better understand the structure of features learned by SAM?
(with
@dara_bahri
,
@TheGradient
, N. Flammarion)
🧵1/n
Excited to give an invited talk at
@ipam_ucla
's workshop "Partial Differential Equations & Inverse Problem Methods in Machine Learning" April 2020. And no, I won't talk about heat equation this time (global warming?), but Poisson's equation. Details soon 😎
The end of double-descent era and the beginning of multi-descent era for generalization curves. This work proves multi-descent curves exist. In fact, it tells you how to shape them up in the way you wish [for linear regression]! Very interesting work
@aminkarbasi
🤓
Can the existence of a multiple descent generalization curve be rigorously proven? Can an arbitrary number of descents occur? Can the generalization curve and the locations of descents be designed? We answer yes to all three of these questions.
ICML Workshop on Nonconvex Optimization was a great success thanks to 7 fantastic speakers (including
@HazanPrinceton
@svlevine
), 34 high quality (accepted) papers (online: ) and great audience.
@AnimaAnandkumar
New theory on SAM:
1. Provides Convergence Analysis: highly non-trivial for SAM due to 1/||∇L|| in the update rule (intricately couples coordinates).
2. ∇³L: We know by two ∇'s, SAM efficiently approximates ∇² (curvature). This work shows SAM implicitly utilizes ∇³ as well!
New paper with Peter Bartlett and
@obousquet
called "The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima": .
Disappointed with fellow AC's decision
#icml2022
. AC instruction: "Only papers where all reviews are in, and all reviews recommend Phase 1 reject, are candidates for Phase 1 rejection." Yet, we got a paper rejected with one of the reviews being accept. cc/
@CsabaSzepesvari
10/10 For more details see , , . Also check papers by
@oberman_adam
. I decided to write this after reading
@docmilanfar
's recent post on bump functions (not Gaussian but related).
Amazing discussion (
@rsalakhu
, Liang, Bartlett, Bengio, Kakade) at NIPS-W "DL: Bridging Theory and Practice". A key direction to
@prfsanjeevarora
is connection between logic/reasoning and differentiable models. Well done
@lschmidt3
@maithra_raghu
.
Mikhail Belkin giving a super interesting talk at
#ICML
#ICML19
#ICML2019
workshop on generalization in deep learning on why over-parameterized models that achieve zero training loss can still generalize. 🔑 is regularization; implicit (e.g. SGD) or explicit (e.g. norm penalty).
#NeurIPS2020
has extended deadline by 48 hours in support of those affected by recent events . The new deadline is Friday June 5, 2020 at 1pm PDT. Thank you
@NeurIPSConf
@YiMaTweets
joining twitter took me down the memory lane of my PhD, and the people to whom I dedicated my dissertation: my wife, child, and parents 🌺 This is a journey that depends on you, your adviser, and the people around you at so many levels.
.
@UMich
friends! I will be visiting on Oct 19th to discuss some mysteries of weight noise, gradient penalties, sharpness regularization, Hessian structure, and activation functions. Let's grab a☕! Work by
@ynd
, A. Agarwala and I. Thanks
@qing
for hosting.
2/2 SAM was motivated for improving generalization in classification tasks. Why training CLIP with SAM yields a representation subspace of lower dimension is an unexplained side effect; great open problem [for y'all!] with great practical impact. cc/
@HochreiterSepp
@jeankaddour
2/5 Then he said this, which has got stuck in mind since then: "I think in the next few years we' see a lot of papers which will show feedback has something significant". Sure do we have RNNs and LSTMs but, those are very limited forms of feedback: short-range and at micro-level.
The growing evidence on benefits of over-parameterization in optimization (attaining lower loss) and generalization (simpler solution in some sense) has turned over-parameterization into free lunch in minds. But, here is twist!
Over-parametrization helps gradient descent find a global minimum. What about the convergence rate?
Over-parametrization can make convergence EXPONENTIALLY SLOWER: from exp(-T) to a surprising 1/T^3 rate!
Paper:
Joint work w. Weihang Xu
A great initiative! The author plans to blogpost a series of materials in colab environment about kernel methods and investigate their behavior in hands-on fashion. Seems very neat and accessible. Looking forward to future posts by
@blake__bordelon
!
Ever wondered when/why kernel methods or infinite width neural networks generalize? I wrote a blog post about a recent theory developed with
@CPehlevan
and
@canatar_a
that predicts test risk for any kernel or data distribution
We just released document V1.0 about the
#NeurIPS2020
generalization competition to cover competition description (datasets, metrics, etc.), timeline, rules and protocols.
We just released the first document of the competition on our website . It covers competition description (datasets, metrics, etc.), timeline, rules and protocols. Hurry up... Phase 1 starts on July 15.
Interesting result by Kumar et al.🔥Many deep RL methods estimate value functions by bootstrapping. By connecting this to self-distillation in RKHS, they show the combo of bootstraping & gradient descent can lead to an "implicit under-parameterization"(low-rank model) phenomenon.
We've been studying why deep RL is so hard, and we think we have another reason: implicit under-parameterization:
Iteratively training on your own targets is a kind of "self-distillation," and leads to loss of rank ->
w/ Aviral Kumar
@agarwl_
@its_dibya
A piratical example of how self-distillation blocks overfitting to noise. Image-text pairs from Internet are usually weakly correlated and noise is abundant. Authors show self-distillation w/ soft labels mitigates this noise. By
@berkeley_ai
(Cheng,
@mejoeyg
) and
@facebookai
.
Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation
pdf:
abs:
model achieves strong performance with only 3M image text pairs, 133x smaller than CLIP
Submission ☠┊(deadline) for
#ICML2019
workshop "Generalization in Deep Networks" extended to May 23. Contribute to understanding this mysterious and 🎁(surpriseful) area, and join us at 🏖️ (Long Beach) for ☕ (hearing great invited speakers and accepted posters).
@icmlconf
☟