Sr Research Scientist at Google DeepMind, Toronto. Member, Mila. Adjunct, McGill CS. PhD Machine Learning & MASt Applied Math (Cambridge), BSc Math (Warwick).
1/ Welcome, Twitter, to my 1st tweet (!) a🧵on new work "Deep Learning on a Data Diet: Finding Important Examples Early in Training" We find that, at init, you can identify and prune a large % of DATA with NO effect on accuracy. w/
@mansiege
@SuryaGanguli
Excited to be co-organizing this upcoming Deep Learning Summer School in my hometown. I am hopeful we can have in-person sessions, even if it means doing them outside in the sun! Details on the application process coming in January.
Mark your calendars! 🎇 The 2022 Eastern European Machine Learning () summer school will be in Vilnius, Lithuania, July 6-14! We are tentatively planning a hybrid format. More details coming soon. Stay tuned!
Deep learning may be hard, but deep un-learning is even harder. 💪
How do we efficiently remove the influence of specific training examples while maintaining good performance on the remainder?
Announcing NeurIPS Unlearning Competition 📢 Submit your best ideas!🏆
📢The NeurIPS 2023 Unlearning Competition is now open for submissions
@kaggle
!📢 Goal is to develop algorithms that can unlearn a subset of the training data. Competition is open to all, and there are prizes for the top performers!.
Our NeurIPS 2021 spotlight presents more evidence that CMI is a universal framework for generalization. Joint work led by
@HaghifamMahdi
, in collaboration with
@roydanroy
and Shay Moran.
🔥 New paper 🔥on
@shortstein
and
@zakynthinou
's CMI framework, demonstrating its unifying nature for obtaining optimal or near-optimal bounds for the expected excess risk in the realizable setting.
Will be a spotlight at NeurIPS’21!
Having recently taken a new position, I wanted to take the opportunity to thank everyone at Element AI and ServiceNow for three incredible years. Since day one, I had the unwavering support from my manager, advisors, and colleagues. 1/3
🔥New on ArXiv 🔥 What training data suffices for good pre-training? The answer can differ vastly depending on what pre-training is for. For finding lottery tickets, a small fraction of easy or randomly selected data works better than pre-training on all data!
1/ What about the dataset is important for networks to learn early in training?
Our new work finds pre-training on a small set of “easy” examples is sufficient to discover inits w/ sparse trainable networks and in half the steps as the full dataset.
📜:
1/ Welcome, Twitter, to my 1st tweet (!) a🧵on new work "Deep Learning on a Data Diet: Finding Important Examples Early in Training" We find that, at init, you can identify and prune a large % of DATA with NO effect on accuracy. w/
@mansiege
@SuryaGanguli
When does pruning succeed❓
Our new paper reveals some important connections between the loss landscape of a sparse subnetwork and its dense counterpart, and the role of linear mode connectivity.
1/ Early in training NN’s we can find very sparse subnetworks (lottery tickets) which match full model performance. But why does this work and when does it break down?
Our new paper shows we succeed primarily by using info about the dense solution.
📜:
Interested in sparse neural networks? Generalization? Pruning algorithms? Come to our NeurIPS poster this afternoon where we present our empirical study on how pruning affects generalization.
I'll present our work about pruning's effect on generalization
@NeurIPS
this Tuesday at 4pm (located at Hall J
#715
)!
Pruning removes unimportant weights in a neural network. Practitioners have long noticed that pruning improves generalization, how does this happen? 1/n
In deep nets, we observe good generalization together with memorization. In this new work, we show that, in stochastic convex optimization, memorization of most of the training data is a necessary feature of optimal learning.
Classical ML approaches suggest memorization can harm generalization. However, the success of overparameterized DNNs challenges this belief.
What is the role of memorization in learning? 🤔🤔🤔
We studied this question in the context of stochastic convex optimization.
How are LLM capabilities affected by pruning? Checkout our
@iclr_conf
paper showing that ICL is preserved until high levels of sparsity, in contrast to fact recall which quickly deteriorates. Our analysis reveals which part of the network is more prunable for a given capability.
When we down-scale LLMs (e.g.pruning), what happens to their capabilities? We studied complementary skills of memory recall & in-context learning and consistently found that memory recall deteriorates much quicker than ICL when down-scaling.
See us
@iclr_conf
Session 1
#133
1/8
Check out a new pruning library, JaxPruner! Excited to see how it will impact those already working in network pruning and quantization, and attract new people interested in trying out / applying these methods in new domains 🚀
Hyped to share JaxPruner: a concise library for sparsity research.
JaxPruner includes 10+ easy-to-modify baseline algorithms and provides integration with popular libraries like t5x, scenic, dopamine and fedjax. 1/7
Code:
Paper:
Looking forward to participating and talking at this
#ICML2023
workshop on PAC-Bayes and interactive learning. Working on related topics? Consider submitting! The deadline is in on May 31st!
This marks our 5th paper studying generalization using information theory 🎆 For interpolating algorithms, we show LOOCMI vanishes iff risk vanishes (and at same rate for poly decay). The 1st such connection in this literature 🔥 Congrats to
@HaghifamMahdi
on superb line of work.
In new work, we propose Leave-One-Out Conditional Mutual Information, a variant of CMI, and show that it bounds E[generalization error].
What's new? A (mutual) information-based bound yields minimax rates for learning general VC classes in the realizable setting.
How? Read on!
How does data affect pre-training for lottery tickets? 🤔
How much data do we actually need? 🤔🤔
Come to our NeurIPS poster 🕚 TODAY at 11am 🕚 to find out!
Putting Lottery Tickets on a Data Diet! Come to our
#NeurIPS2022
poster today (Dec 1) at 11 am, Hall J
#407
! Find out how just a tiny fraction of easy data is enough to find initializations with sparse trainable networks and speed up training! Check out our 🧵for a summary!
Excited to share our work "Scaling Laws for Sparsely-Connected Foundation Models" () where we develop the first scaling laws for (fine-grained) parameter-sparsity in the context of modern Transformers trained on massive datasets. 1/10
7/ We find that the mode SGD converges to is determined earlier in training for “prunable” examples (up to linear mode connectivity). In contrast, the optimization landscape evaluated on important-for-training examples is sensitive to SGD noise throughout training.
2/ We observe that, on standard vision benchmarks, the initial loss gradient norm of individual training examples, averaged over several weight initializations, (the "GRaND" score) can be used to identify a subset of training data that suffices to train to high accuracy. But...
I want to highlight a fantastic initiative by a good friend of mine -- a mentoring program teaching students advanced physics way beyond high-school level in an innovative way. If you know any 15-16 year olds with a passion for physics, please share this opportunity with them!
3/ after only a few epochs of training, the information in our GRaND score is reflected in the normed error (L2 distance between the predicted probabilities and one hot labels) which can be used to efficiently prune a significant % of data without sacrificing test accuracy.
6/ Our work sheds light on how the underlying data distribution shapes training dynamics: our scores rank examples based on importance for generalization, detect noisy examples, and identify subspaces of the model's data representation that are relatively stable over training.
5/ Toneva et al. find that some examples are rarely forgotten, while others are forgotten repeatedly. They show one can prune rarely forgotten examples. Their "forget" score is usually computed after training: we observe that it works after a few epochs, but not at init.
I also had the privilege to attend multiple programs at the Simons Institute and a special year at the IAS. I am grateful for the opportunities I got to think beyond my own research and lead a team with incredible colleagues. 2/3
4/ Based on these findings, we propose data pruning methods that use only local information early in training, and connect them to work by Toneva et al. (2018) that tracks the # of times through training an example transitions from being correctly classified to misclassified.
@davelewisdotir
@mansiege
@SuryaGanguli
Thanks for the reference, David. I was not aware of this work. The approach and details seem similar to coreset approaches, which we discuss, but even that literature doesn't cite this work, interestingly. We'll make sure readers know about it.
It was a great experience to give an in-person talk at
#MLinPL
to such an active audience with lots of thoughtful questions! Thanks again to the fantastic hosts who made me feel really welcomed, and made sure my visit was really well planned.
We would like to extend a big thank you to our sponsor
@GoogleDeepMind
and the affiliated invited speakers:
@VladMnih
and
@gkdziugaite
. We were thrilled to host you at the ML in PL Conference 2023!
Sharing this fantastic opportunity for high school students interested in STEM! It’s a unique mentoring program enabling young people to engage in research and advanced topics in math. Applications are due in September 🗓️
🌟 Exciting news! Applications for the
#BeyondResearch
programme, an innovative, dedicated two-year mentoring and teaching programme for gifted and talented students in STEM are open again! 🚀🔭 Discover more at
👉
#opportunity
#education
#Maths
@botian_
@mansiege
@SuryaGanguli
I am curious too :) In our experiments, going from CIFAR10 to CIFAR100 the amount of data we need (in percentage) doubled.
@HsseinMzannar
@mansiege
@SuryaGanguli
Great question.
Here we select examples from the given training set, and do not consider creating new inputs that would summarize the training data. My intuition is that one could do much better using the latter approach, but at what computational cost? Curious to read more.
@MyNameIsTooLon
@mansiege
@SuryaGanguli
The EL2N score is associated with an individual example. The Brier score, in this setting, would average over training examples. So the average EL2N score over a dataset could be called Brier score. It's a nice connection---thanks for pointing it out.
@nsaphra
@kchonyc
We confirmed this finding on several standard architectures trained on vision datasets, for all of which rewinding to initialization does *not* work (one needs to rewind to some point early in training).