Gintare Karolina Dziugaite Profile Banner
Gintare Karolina Dziugaite Profile
Gintare Karolina Dziugaite

@gkdziugaite

3,765
Followers
112
Following
3
Media
80
Statuses

Sr Research Scientist at Google DeepMind, Toronto. Member, Mila. Adjunct, McGill CS. PhD Machine Learning & MASt Applied Math (Cambridge), BSc Math (Warwick).

Toronto, Ontario
Joined January 2018
Don't wanna be here? Send us removal request.
@gkdziugaite
Gintare Karolina Dziugaite
3 years
1/ Welcome, Twitter, to my 1st tweet (!) a🧵on new work "Deep Learning on a Data Diet: Finding Important Examples Early in Training" We find that, at init, you can identify and prune a large % of DATA with NO effect on accuracy. w/ @mansiege @SuryaGanguli
Tweet media one
10
134
759
@gkdziugaite
Gintare Karolina Dziugaite
3 years
Excited to be co-organizing this upcoming Deep Learning Summer School in my hometown. I am hopeful we can have in-person sessions, even if it means doing them outside in the sun! Details on the application process coming in January.
@EEMLcommunity
EEML
3 years
Mark your calendars! 🎇 The 2022 Eastern European Machine Learning () summer school will be in Vilnius, Lithuania, July 6-14! We are tentatively planning a hybrid format. More details coming soon. Stay tuned!
Tweet media one
2
47
216
1
29
295
@gkdziugaite
Gintare Karolina Dziugaite
1 year
Deep learning may be hard, but deep un-learning is even harder. 💪 How do we efficiently remove the influence of specific training examples while maintaining good performance on the remainder? Announcing NeurIPS Unlearning Competition 📢 Submit your best ideas!🏆
@unlearning_2023
unlearning-challenge
1 year
📢The NeurIPS 2023 Unlearning Competition is now open for submissions @kaggle !📢 Goal is to develop algorithms that can unlearn a subset of the training data. Competition is open to all, and there are prizes for the top performers!.
0
29
104
1
40
226
@gkdziugaite
Gintare Karolina Dziugaite
3 years
Our NeurIPS 2021 spotlight presents more evidence that CMI is a universal framework for generalization. Joint work led by @HaghifamMahdi , in collaboration with @roydanroy and Shay Moran.
@HaghifamMahdi
Mahdi Haghifam
3 years
🔥 New paper 🔥on @shortstein and @zakynthinou 's CMI framework, demonstrating its unifying nature for obtaining optimal or near-optimal bounds for the expected excess risk in the realizable setting. Will be a spotlight at NeurIPS’21!
Tweet media one
3
42
227
0
27
161
@gkdziugaite
Gintare Karolina Dziugaite
3 years
Having recently taken a new position, I wanted to take the opportunity to thank everyone at Element AI and ServiceNow for three incredible years. Since day one, I had the unwavering support from my manager, advisors, and colleagues. 1/3
6
3
151
@gkdziugaite
Gintare Karolina Dziugaite
2 years
🔥New on ArXiv 🔥 What training data suffices for good pre-training? The answer can differ vastly depending on what pre-training is for. For finding lottery tickets, a small fraction of easy or randomly selected data works better than pre-training on all data!
@_BrettLarsen
Brett Larsen
2 years
1/ What about the dataset is important for networks to learn early in training? Our new work finds pre-training on a small set of “easy” examples is sufficient to discover inits w/ sparse trainable networks and in half the steps as the full dataset. 📜:
2
47
243
0
25
134
@gkdziugaite
Gintare Karolina Dziugaite
3 years
Poster session happening now!
@gkdziugaite
Gintare Karolina Dziugaite
3 years
1/ Welcome, Twitter, to my 1st tweet (!) a🧵on new work "Deep Learning on a Data Diet: Finding Important Examples Early in Training" We find that, at init, you can identify and prune a large % of DATA with NO effect on accuracy. w/ @mansiege @SuryaGanguli
Tweet media one
10
134
759
0
13
95
@gkdziugaite
Gintare Karolina Dziugaite
2 years
When does pruning succeed❓ Our new paper reveals some important connections between the loss landscape of a sparse subnetwork and its dense counterpart, and the role of linear mode connectivity.
@_BrettLarsen
Brett Larsen
2 years
1/ Early in training NN’s we can find very sparse subnetworks (lottery tickets) which match full model performance. But why does this work and when does it break down? Our new paper shows we succeed primarily by using info about the dense solution. 📜:
Tweet media one
5
52
398
0
14
74
@gkdziugaite
Gintare Karolina Dziugaite
2 years
Interested in sparse neural networks? Generalization? Pruning algorithms? Come to our NeurIPS poster this afternoon where we present our empirical study on how pruning affects generalization.
@tjingrant
Tian Jin
2 years
I'll present our work about pruning's effect on generalization @NeurIPS this Tuesday at 4pm (located at Hall J #715 )! Pruning removes unimportant weights in a neural network. Practitioners have long noticed that pruning improves generalization, how does this happen? 1/n
1
5
46
0
6
52
@gkdziugaite
Gintare Karolina Dziugaite
8 months
In deep nets, we observe good generalization together with memorization. In this new work, we show that, in stochastic convex optimization, memorization of most of the training data is a necessary feature of optimal learning.
@HaghifamMahdi
Mahdi Haghifam
8 months
Classical ML approaches suggest memorization can harm generalization. However, the success of overparameterized DNNs challenges this belief. What is the role of memorization in learning? 🤔🤔🤔 We studied this question in the context of stochastic convex optimization.
1
0
12
0
2
46
@gkdziugaite
Gintare Karolina Dziugaite
6 months
How are LLM capabilities affected by pruning? Checkout our @iclr_conf paper showing that ICL is preserved until high levels of sparsity, in contrast to fact recall which quickly deteriorates. Our analysis reveals which part of the network is more prunable for a given capability.
@tjingrant
Tian Jin
6 months
When we down-scale LLMs (e.g.pruning), what happens to their capabilities? We studied complementary skills of memory recall & in-context learning and consistently found that memory recall deteriorates much quicker than ICL when down-scaling. See us @iclr_conf Session 1 #133 1/8
1
10
54
2
7
47
@gkdziugaite
Gintare Karolina Dziugaite
1 year
Check out a new pruning library, JaxPruner! Excited to see how it will impact those already working in network pruning and quantization, and attract new people interested in trying out / applying these methods in new domains 🚀
@utkuevci
utku
1 year
Hyped to share JaxPruner: a concise library for sparsity research. JaxPruner includes 10+ easy-to-modify baseline algorithms and provides integration with popular libraries like t5x, scenic, dopamine and fedjax. 1/7 Code: Paper:
Tweet media one
1
32
156
0
10
45
@gkdziugaite
Gintare Karolina Dziugaite
1 year
Looking forward to participating and talking at this #ICML2023 workshop on PAC-Bayes and interactive learning. Working on related topics? Consider submitting! The deadline is in on May 31st!
@audurand
Audrey Durand
1 year
Thrilled to announce the speakers at the #ICML2023 workshop "PAC-Bayes Meets Interactive Learning": @Majumdar_Ani @gkdziugaite @MLpager @jonasro_ and Aaditya Ramdas! CFP: @icmlconf @IID_ULaval @Mila_Quebec @bguedj @TheBayesist @max_heuillet @flynn_hamish
3
10
38
0
4
34
@gkdziugaite
Gintare Karolina Dziugaite
2 years
This marks our 5th paper studying generalization using information theory 🎆 For interpolating algorithms, we show LOOCMI vanishes iff risk vanishes (and at same rate for poly decay). The 1st such connection in this literature 🔥 Congrats to @HaghifamMahdi on superb line of work.
@HaghifamMahdi
Mahdi Haghifam
2 years
In new work, we propose Leave-One-Out Conditional Mutual Information, a variant of CMI, and show that it bounds E[generalization error]. What's new? A (mutual) information-based bound yields minimax rates for learning general VC classes in the realizable setting. How? Read on!
2
13
97
1
5
33
@gkdziugaite
Gintare Karolina Dziugaite
2 years
How does data affect pre-training for lottery tickets? 🤔 How much data do we actually need? 🤔🤔 Come to our NeurIPS poster 🕚 TODAY at 11am 🕚 to find out!
@mansiege
Mansheej Paul
2 years
Putting Lottery Tickets on a Data Diet! Come to our #NeurIPS2022 poster today (Dec 1) at 11 am, Hall J #407 ! Find out how just a tiny fraction of easy data is enough to find initializations with sparse trainable networks and speed up training! Check out our 🧵for a summary!
1
11
50
1
5
24
@gkdziugaite
Gintare Karolina Dziugaite
1 year
Check out this new exciting paper by my colleagues bringing together two hot topics: sparsity and scaling laws! 🔥
@elias_frantar
Elias Frantar
1 year
Excited to share our work "Scaling Laws for Sparsely-Connected Foundation Models" () where we develop the first scaling laws for (fine-grained) parameter-sparsity in the context of modern Transformers trained on massive datasets. 1/10
3
26
127
0
3
23
@gkdziugaite
Gintare Karolina Dziugaite
3 years
7/ We find that the mode SGD converges to is determined earlier in training for “prunable” examples (up to linear mode connectivity). In contrast, the optimization landscape evaluated on important-for-training examples is sensitive to SGD noise throughout training.
Tweet media one
3
3
19
@gkdziugaite
Gintare Karolina Dziugaite
3 years
2/ We observe that, on standard vision benchmarks, the initial loss gradient norm of individual training examples, averaged over several weight initializations, (the "GRaND" score) can be used to identify a subset of training data that suffices to train to high accuracy. But...
1
1
18
@gkdziugaite
Gintare Karolina Dziugaite
2 years
I want to highlight a fantastic initiative by a good friend of mine -- a mentoring program teaching students advanced physics way beyond high-school level in an innovative way. If you know any 15-16 year olds with a passion for physics, please share this opportunity with them!
@physics_beyond
PhysicsBeyond
2 years
Do you aspire a career in #STEM #research ? We are awarding #scholarships worth $9,500 per year for the innovative and dedicated #mentoring programme #BeyondResearch for #students age 15-16:
Tweet media one
1
2
2
0
4
14
@gkdziugaite
Gintare Karolina Dziugaite
3 years
3/ after only a few epochs of training, the information in our GRaND score is reflected in the normed error (L2 distance between the predicted probabilities and one hot labels) which can be used to efficiently prune a significant % of data without sacrificing test accuracy.
Tweet media one
1
1
14
@gkdziugaite
Gintare Karolina Dziugaite
3 years
6/ Our work sheds light on how the underlying data distribution shapes training dynamics: our scores rank examples based on importance for generalization, detect noisy examples, and identify subspaces of the model's data representation that are relatively stable over training.
1
1
14
@gkdziugaite
Gintare Karolina Dziugaite
3 years
5/ Toneva et al. find that some examples are rarely forgotten, while others are forgotten repeatedly. They show one can prune rarely forgotten examples. Their "forget" score is usually computed after training: we observe that it works after a few epochs, but not at init.
2
1
12
@gkdziugaite
Gintare Karolina Dziugaite
3 years
Alas, the time came for me to take on a new challenge, but I will not forget the great community where I started my professional career. 3/3
0
0
12
@gkdziugaite
Gintare Karolina Dziugaite
3 years
I also had the privilege to attend multiple programs at the Simons Institute and a special year at the IAS. I am grateful for the opportunities I got to think beyond my own research and lead a team with incredible colleagues. 2/3
1
0
11
@gkdziugaite
Gintare Karolina Dziugaite
3 years
4/ Based on these findings, we propose data pruning methods that use only local information early in training, and connect them to work by Toneva et al. (2018) that tracks the # of times through training an example transitions from being correctly classified to misclassified.
1
1
11
@gkdziugaite
Gintare Karolina Dziugaite
3 years
8/ There are many more questions to be answered! Feel free to reach out after you've read the paper!
1
1
9
@gkdziugaite
Gintare Karolina Dziugaite
3 years
@davelewisdotir @mansiege @SuryaGanguli Thanks for the reference, David. I was not aware of this work. The approach and details seem similar to coreset approaches, which we discuss, but even that literature doesn't cite this work, interestingly. We'll make sure readers know about it.
0
0
7
@gkdziugaite
Gintare Karolina Dziugaite
11 months
It was a great experience to give an in-person talk at #MLinPL to such an active audience with lots of thoughtful questions! Thanks again to the fantastic hosts who made me feel really welcomed, and made sure my visit was really well planned.
@MLinPL
ML in PL
11 months
We would like to extend a big thank you to our sponsor @GoogleDeepMind and the affiliated invited speakers: @VladMnih and @gkdziugaite . We were thrilled to host you at the ML in PL Conference 2023!
Tweet media one
0
1
2
1
0
4
@gkdziugaite
Gintare Karolina Dziugaite
1 year
Sharing this fantastic opportunity for high school students interested in STEM! It’s a unique mentoring program enabling young people to engage in research and advanced topics in math. Applications are due in September 🗓️
@physics_beyond
PhysicsBeyond
1 year
🌟 Exciting news! Applications for the #BeyondResearch programme, an innovative, dedicated two-year mentoring and teaching programme for gifted and talented students in STEM are open again! 🚀🔭 Discover more at 👉 #opportunity #education #Maths
Tweet media one
0
4
7
0
1
3
@gkdziugaite
Gintare Karolina Dziugaite
2 years
@Robert_Baldock @irinarish @JonasAndrulis @jefrankle @nandofioretto @sarahookr Noticed this a bit too late. Can you share a link to the paper?
2
0
3
@gkdziugaite
Gintare Karolina Dziugaite
3 years
@jamtou @mansiege @SuryaGanguli @LipizzanerGAN How did I not find this paper given the title!? :) thanks for sharing, looks very interesting.
0
0
2
@gkdziugaite
Gintare Karolina Dziugaite
3 years
@botian_ @mansiege @SuryaGanguli I am curious too :) In our experiments, going from CIFAR10 to CIFAR100 the amount of data we need (in percentage) doubled.
0
0
2
@gkdziugaite
Gintare Karolina Dziugaite
2 years
@jefrankle @roydanroy @MITIBMLab @GoogleAI Congratulations! I look forward to future collaborations with Dr. Frankle 🎓
0
0
2
@gkdziugaite
Gintare Karolina Dziugaite
3 years
@HsseinMzannar @mansiege @SuryaGanguli Great question. Here we select examples from the given training set, and do not consider creating new inputs that would summarize the training data. My intuition is that one could do much better using the latter approach, but at what computational cost? Curious to read more.
0
0
1
@gkdziugaite
Gintare Karolina Dziugaite
3 years
@MyNameIsTooLon @mansiege @SuryaGanguli The EL2N score is associated with an individual example. The Brier score, in this setting, would average over training examples. So the average EL2N score over a dataset could be called Brier score. It's a nice connection---thanks for pointing it out.
0
0
1
@gkdziugaite
Gintare Karolina Dziugaite
10 months
@Abhinav98M I don't think so, but this talk was a longer version of the one i gave at ICML'23
0
0
1
@gkdziugaite
Gintare Karolina Dziugaite
3 years
@nsaphra @kchonyc We confirmed this finding on several standard architectures trained on vision datasets, for all of which rewinding to initialization does *not* work (one needs to rewind to some point early in training).
1
0
1
@gkdziugaite
Gintare Karolina Dziugaite
2 years
@OmarRivasplata @stats_UCL Thanks for the invite and a great discussion after the talk!
0
0
1