Christina Baek @ ICML Profile
Christina Baek @ ICML

@_christinabaek

959
Followers
277
Following
13
Media
50
Statuses

PhD student @mldcmu | Past: intern @GoogleAI

Joined June 2021
Don't wanna be here? Send us removal request.
Pinned Tweet
@_christinabaek
Christina Baek @ ICML
3 months
Did you know that the optimizer Sharpness Aware Minimization (SAM) is very robust to heavy label noise, with gains tens of percent above SGD? In our new work, we deep dive into how SAM achieves these gains. As it turns out, it’s not at all about sharpness at convergence!
Tweet media one
3
43
245
@_christinabaek
Christina Baek @ ICML
2 years
Estimating out-of-distribution (OOD) performance is hard because labeled data is expensive. Can we predict OOD performance w/ only _unlabeled data_? In our work (), we show this can be done using models’ agreement. w/ @yidingjiang , Aditi R., @zicokolter
12
82
516
@_christinabaek
Christina Baek @ ICML
2 years
Come visit Hall J #1037 on Tue Nov 29th, at 4 pm CST to hear about Agreement-on-the-Line (). Attending NeurIPS all week, please feel free to reach out to talk about the work, generalization, etc. 🙂
2
13
86
@_christinabaek
Christina Baek @ ICML
4 months
Check out this really cool work w/ @yidingjiang ! We built a simple system (PCA + Clustering) for quantifying how "features" are distributed across models and data. Using this tool, we can mathematically understand the Generalization Disagreement Equality. 🤝
@yidingjiang
Yiding Jiang
4 months
Models with different randomness make different predictions at test time even if they are trained on the same data. In our latest ICLR paper (oral), we investigate how models learn different features, and the effect this has on agreement and (potentially) calibration. 1/
Tweet media one
3
35
143
1
1
32
@_christinabaek
Christina Baek @ ICML
2 years
Maximal coding rate reduction (MCR2) objective has many nice properties, but it's too expensive to feasibly train on large datasets: # of log det terms scales with # of classes. Our reformulation is not only scalable, but also seems to better control for noise.
@YiMaTweets
Yi Ma
2 years
A paper from recently graduated students: Two key messages: 1. the rate reduction objective can be computed more efficiently via its Variational Form; 2. the variation form can be naturally interpreted as Dictionary Learning. All things seem connected...
2
3
18
1
3
19
@_christinabaek
Christina Baek @ ICML
3 months
10/ Paper: While challenging, I learned a lot from this work w/ @zicokolter and @AdtRaghunathan . I began studying SAM w/ my Google mentors @bneyshabur , @TheGradient , @dara_bahri , and Shankar whom I want to deeply thank for their guidance and discussion 😊
2
1
12
@_christinabaek
Christina Baek @ ICML
2 years
2/ Miller et al. () showed that a model’s in-distribution (ID) accuracy has a strong linear correlation with its OOD accuracy on a wide variety of datasets. ​​Surprisingly, this is also true for agreement between different models.
Tweet media one
1
1
8
@_christinabaek
Christina Baek @ ICML
3 months
4/ Instead, we argue that to an extent, SAM “merely” acts by learning clean examples faster. In fact, fitting more clean examples before memorizing noisy examples, as measured by the gap between their training accuracies correlates strongly with test accuracy.
Tweet media one
Tweet media two
1
0
7
@_christinabaek
Christina Baek @ ICML
3 months
1/ As it was shown in Foret et al. 2020, SAM’s gains above SGD are particularly notable under heavy label noise. We see a 20% boost on ResNet18 + CIFAR10 w/ 30% label noise. In fact, SAM achieves test accuracy comparable to other state-of-the-art label noise robustness methods!
Tweet media one
1
0
7
@_christinabaek
Christina Baek @ ICML
3 months
5/ We find a mechanistic explanation for this behavior by analyzing the role of different components of SAM’s update. By chain rule, we decompose the gradient of logistic loss into the “logit scale” and “Jacobian” components, and see how SAM’s weight perturbation affects them.
Tweet media one
1
0
6
@_christinabaek
Christina Baek @ ICML
3 months
6/ In linear models, SAM and SGD updates only differ in their logit scale. Here, SAM simply adds a constant to the logit scale and this has the effect of up-weighting the gradients of low loss examples. Low loss points tend to correspond with clean points earlier on in training.
Tweet media one
Tweet media two
1
0
6
@_christinabaek
Christina Baek @ ICML
3 months
2/ SAM is an optimizer designed to find solutions in flatter regions of the loss landscape. To do this, SAM takes a gradient step evaluated at some _perturbed_ weight. We study 1-SAM in particular, which generally achieves the highest performance boosts.
Tweet media one
1
1
6
@_christinabaek
Christina Baek @ ICML
2 years
3/ Agreement measures how often two models make the same prediction. We see that ID vs OOD agreement is also strongly linearly correlated when ID vs OOD accuracy is. Also, these linear correlations have almost _the same slope and bias_. We call this “agreement-on-the-line”.
1
1
6
@_christinabaek
Christina Baek @ ICML
3 months
11/ I’m presenting this work in Vienna this week. Come visit our #ICLR2024 poster on Wednesday morning (Session 3, Poster No. 3)!
0
0
6
@_christinabaek
Christina Baek @ ICML
2 years
4/ Unlike accuracy, agreement can be computed without labels. Using agreement-on-the-line, we develop a straightforward method to predict OOD performance using unlabeled OOD test data. Given a set of models, we compute the agreement between every pair for ID and OOD test data.
2
0
6
@_christinabaek
Christina Baek @ ICML
2 years
10/ Interestingly, the agreement-on-the-line phenomena (specifically the fact that the ID vs OOD accuracy and agreement slopes/biases are the same), appears to empirically be specific to neural networks.
Tweet media one
1
0
5
@_christinabaek
Christina Baek @ ICML
2 years
5/ In practice, we compute agreement between a collection of pre-trained + from scratch models w/ different architecture types, initializations, etc. Empirically, our approach is robust to the choice of models in the collection. Our results are on both CNNs and transformers.
1
0
5
@_christinabaek
Christina Baek @ ICML
2 years
6/ Then we compute the slope and bias of ID vs OOD agreement. If the linear correlation is strong, by agreement-on-the-line, we know that the slope a and bias b matches that of accuracy. Given ID accuracy, we may simply predict OOD_accuracy = a * ID_accuracy + b.
Tweet media one
1
0
5
@_christinabaek
Christina Baek @ ICML
2 years
11/ This raises a number of interesting and open questions as to the significance of agreement between classifiers, the nature of deep network generalization, and beyond.
0
0
5
@_christinabaek
Christina Baek @ ICML
2 years
There's also a panel talk about the paper on Dec 6th at 8 pm EST.
1
0
4
@_christinabaek
Christina Baek @ ICML
3 months
8/ We prove that in 2-layer deep linear nets, SAM’s Jacobian regularizes the norm of the last-layer weights and features. Empirically, this also happens in deeper networks. Also, if we explicitly do SGD + regularize the feature/weight norm, we see a big boost in performance!
Tweet media one
Tweet media two
1
0
4
@_christinabaek
Christina Baek @ ICML
2 years
9/ Do we need to train a lot of models to get a good linear fit? We save checkpoints of one model every 5 epochs, and see that even the agreement between these pairs satisfy agreement-on-the-line. We can use this to guess the accuracy of the model along its training trajectory.
Tweet media one
1
0
4
@_christinabaek
Christina Baek @ ICML
2 years
8/ Our method even comes with a built-in “sanity check” to assess whether its OOD accuracy estimate is reasonable. Since our approach works best when agreement-on-the-line holds, we can check this condition (w/ only unlabeled data) to see if the estimate will be accurate.
1
0
4
@_christinabaek
Christina Baek @ ICML
3 months
7/ Is logit scale the only important component of SAM under label noise? Turns out no, for deep networks. While logit scale displays a similar up-weighting effect, just doing SAM’s perturbation on the Jacobian term alone can recover most of SAM’s gains.
Tweet media one
1
0
4
@_christinabaek
Christina Baek @ ICML
3 months
9/ We suspect that the Jacobian regularization has connections to label noise robustness methods that clip or regularize the magnitude of the network output as means to balance the gradients of clean and noisy examples.
1
0
3
@_christinabaek
Christina Baek @ ICML
3 months
3/ While the tie between sharp minima and generalization is contentious, it’s definitely not interesting for understanding SAM under heavy label noise. As shown above, the test accuracy increases then drops back down, so we don’t actually care about what happens at convergence.
1
0
4
@_christinabaek
Christina Baek @ ICML
2 years
Learned a lot about sparse coding from this project during my master's with Professor Ma. 🍊
0
0
3
@_christinabaek
Christina Baek @ ICML
2 years
7/ We call this approach the ALine-S method, and we can improve upon it slightly by estimating all OOD accuracies for every model jointly, which we call in the ALine-D method. See paper for details.
1
0
3
@_christinabaek
Christina Baek @ ICML
2 years
@TheGradient @GoogleAI @yidingjiang @zicokolter Thank you for hosting! 😊 Excited to learn a lot
0
0
2
@_christinabaek
Christina Baek @ ICML
2 years
@BlackHC @yidingjiang @zicokolter Interesting, do you happen to have any particular papers you can refer me to?
1
0
2
@_christinabaek
Christina Baek @ ICML
2 years
@acherm @yidingjiang @zicokolter That's a great question. Our experiments were mostly on shifts where the input dimension stayed the same (e.g. shifts on CIFAR10). For datasets where the images are of different input dimensions (e.g. ImageNet), we center crop the images to the same sizes.
0
0
2
@_christinabaek
Christina Baek @ ICML
2 years
@BlackHC @yidingjiang @zicokolter Thank you for the references. Our method utilizes information from a set of models in a way that is very distant from the way deep ensembles are used in uncertainty estimation such as in Lakshminarayanan, 2017. We're not simply looking at the average of the softmax outputs.
1
0
1
@_christinabaek
Christina Baek @ ICML
2 years
@BlackHC @yidingjiang @zicokolter From our results, it does look like confidence based methods also perform better when accuracy is on the line. With better understanding of our method, perhaps more concrete connections to previous works can be made in the future.
0
0
1
@_christinabaek
Christina Baek @ ICML
2 years
@BlackHC @yidingjiang @zicokolter Though connections can be made between mutual information and agreement, the way we relate ID vs OOD agreement to ID vs OOD accuracy is something quite different.
1
0
1
@_christinabaek
Christina Baek @ ICML
2 years
@BlackHC @yidingjiang @zicokolter Why our method works is definitely a mystery. I wrote a short note in the appendix on why this phenomena seems to go beyond the requirement of the ensemble being calibrated
0
0
1
@_christinabaek
Christina Baek @ ICML
2 years
@BlackHC @yidingjiang @zicokolter We focused our literature review on works that estimate accuracy in particular (thus generally no mention of detection other than Hendrycks). In the present, I think there’s only a very loose connection in that we have an ensemble and use agreement (MI).
0
0
1
@_christinabaek
Christina Baek @ ICML
2 years
@savvyRL @kchonyc @OriSenbazuru @ml_collective absolutely! thank you for the invite
0
0
1