Christina Baek @ ICML @_christinabaek Twitter profile | Pikagi

Pikagi

Christina Baek @ ICML

@_christinabaek

959

Followers

277

Following

13

Media

50

Statuses

PhD student @mldcmu | Past: intern @GoogleAI

https://t.co/dqLxgA8MLR

Joined June 2021

Don't wanna be here? Send us removal request.

Pinned Tweet

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

Did you know that the optimizer Sharpness Aware Minimization (SAM) is very robust to heavy label noise, with gains tens of percent above SGD? In our new work, we deep dive into how SAM achieves these gains. As it turns out, it’s not at all about sharpness at convergence!

Tweet media one

3

43

245

Last Seen Profiles

@DearbhailDibs

@Unify_BOT

@J10270410L

@itsgabfit

@SQUADS_R6

@tuanpmhd

@mrsaiio4

@motopatru196

@BiologiaDesde0

@RuffinMcNeill

@hijabjilbab1

@MarkJon251

@ibubohay2

@tempcount23

@fedeenchepe

@hiroki_e_0711

@chu2project

@hhh65421

@ce1c3b655892463

@ms_niasha

@K1du1s

@xinwendiaocha

@Termaktoeb

@CarmenSambrano

@MikusYolon61946

@StormyS70579

@quietcrowd1995

@MariannaSh90858

@Christy_White_

@PrimitiveKnot

@pickleballbp

@yourtruesins

@SQUADS_R6

@tenchijin_pr

@stw_pdg

@O11151673Kair

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

Estimating out-of-distribution (OOD) performance is hard because labeled data is expensive. Can we predict OOD performance w/ only _unlabeled data_? In our work (), we show this can be done using models’ agreement. w/ @yidingjiang , Aditi R., @zicokolter

12

82

516

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

Come visit Hall J #1037 on Tue Nov 29th, at 4 pm CST to hear about Agreement-on-the-Line (). Attending NeurIPS all week, please feel free to reach out to talk about the work, generalization, etc. 🙂

Tweet card media

Agreement-on-the-Line: Predicting the Performance of Neural...

Recently, Miller et al. showed that a model's in-distribution (ID) accuracy has a strong linear correlation with its out-of-distribution (OOD) accuracy on several OOD benchmarks -- a phenomenon...

2

13

86

@_christinabaek

Christina Baek @ ICML

@_christinabaek

4 months

Check out this really cool work w/ @yidingjiang ! We built a simple system (PCA + Clustering) for quantifying how "features" are distributed across models and data. Using this tool, we can mathematically understand the Generalization Disagreement Equality. 🤝

@yidingjiang

Yiding Jiang

4 months

Models with different randomness make different predictions at test time even if they are trained on the same data. In our latest ICLR paper (oral), we investigate how models learn different features, and the effect this has on agreement and (potentially) calibration. 1/

Tweet media one

3

35

143

1

1

32

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

Maximal coding rate reduction (MCR2) objective has many nice properties, but it's too expensive to feasibly train on large datasets: # of log det terms scales with # of classes. Our reformulation is not only scalable, but also seems to better control for noise.

@YiMaTweets

Yi Ma

2 years

A paper from recently graduated students: Two key messages: 1. the rate reduction objective can be computed more efficiently via its Variational Form; 2. the variation form can be naturally interpreted as Dictionary Learning. All things seem connected...

2

3

18

1

3

19

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

10/ Paper: While challenging, I learned a lot from this work w/ @zicokolter and @AdtRaghunathan . I began studying SAM w/ my Google mentors @bneyshabur , @TheGradient , @dara_bahri , and Shankar whom I want to deeply thank for their guidance and discussion 😊

2

1

12

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

2/ Miller et al. () showed that a model’s in-distribution (ID) accuracy has a strong linear correlation with its OOD accuracy on a wide variety of datasets. Surprisingly, this is also true for agreement between different models.

Tweet media one

1

1

8

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

4/ Instead, we argue that to an extent, SAM “merely” acts by learning clean examples faster. In fact, fitting more clean examples before memorizing noisy examples, as measured by the gap between their training accuracies correlates strongly with test accuracy.

Tweet media one

Tweet media two

1

0

7

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

1/ As it was shown in Foret et al. 2020, SAM’s gains above SGD are particularly notable under heavy label noise. We see a 20% boost on ResNet18 + CIFAR10 w/ 30% label noise. In fact, SAM achieves test accuracy comparable to other state-of-the-art label noise robustness methods!

Tweet media one

1

0

7

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

5/ We find a mechanistic explanation for this behavior by analyzing the role of different components of SAM’s update. By chain rule, we decompose the gradient of logistic loss into the “logit scale” and “Jacobian” components, and see how SAM’s weight perturbation affects them.

Tweet media one

1

0

6

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

6/ In linear models, SAM and SGD updates only differ in their logit scale. Here, SAM simply adds a constant to the logit scale and this has the effect of up-weighting the gradients of low loss examples. Low loss points tend to correspond with clean points earlier on in training.

Tweet media one

Tweet media two

1

0

6

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

2/ SAM is an optimizer designed to find solutions in flatter regions of the loss landscape. To do this, SAM takes a gradient step evaluated at some _perturbed_ weight. We study 1-SAM in particular, which generally achieves the highest performance boosts.

Tweet media one

1

1

6

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

3/ Agreement measures how often two models make the same prediction. We see that ID vs OOD agreement is also strongly linearly correlated when ID vs OOD accuracy is. Also, these linear correlations have almost _the same slope and bias_. We call this “agreement-on-the-line”.

1

1

6

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

11/ I’m presenting this work in Vienna this week. Come visit our #ICLR2024 poster on Wednesday morning (Session 3, Poster No. 3)!

0

0

6

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

4/ Unlike accuracy, agreement can be computed without labels. Using agreement-on-the-line, we develop a straightforward method to predict OOD performance using unlabeled OOD test data. Given a set of models, we compute the agreement between every pair for ID and OOD test data.

2

0

6

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

10/ Interestingly, the agreement-on-the-line phenomena (specifically the fact that the ID vs OOD accuracy and agreement slopes/biases are the same), appears to empirically be specific to neural networks.

Tweet media one

1

0

5

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

5/ In practice, we compute agreement between a collection of pre-trained + from scratch models w/ different architecture types, initializations, etc. Empirically, our approach is robust to the choice of models in the collection. Our results are on both CNNs and transformers.

1

0

5

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

6/ Then we compute the slope and bias of ID vs OOD agreement. If the linear correlation is strong, by agreement-on-the-line, we know that the slope a and bias b matches that of accuracy. Given ID accuracy, we may simply predict OOD_accuracy = a * ID_accuracy + b.

Tweet media one

1

0

5

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

11/ This raises a number of interesting and open questions as to the significance of agreement between classifiers, the nature of deep network generalization, and beyond.

0

0

5

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

There's also a panel talk about the paper on Dec 6th at 8 pm EST.

1

0

4

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

8/ We prove that in 2-layer deep linear nets, SAM’s Jacobian regularizes the norm of the last-layer weights and features. Empirically, this also happens in deeper networks. Also, if we explicitly do SGD + regularize the feature/weight norm, we see a big boost in performance!

Tweet media one

Tweet media two

1

0

4

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

9/ Do we need to train a lot of models to get a good linear fit? We save checkpoints of one model every 5 epochs, and see that even the agreement between these pairs satisfy agreement-on-the-line. We can use this to guess the accuracy of the model along its training trajectory.

Tweet media one

1

0

4

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

8/ Our method even comes with a built-in “sanity check” to assess whether its OOD accuracy estimate is reasonable. Since our approach works best when agreement-on-the-line holds, we can check this condition (w/ only unlabeled data) to see if the estimate will be accurate.

1

0

4

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

7/ Is logit scale the only important component of SAM under label noise? Turns out no, for deep networks. While logit scale displays a similar up-weighting effect, just doing SAM’s perturbation on the Jacobian term alone can recover most of SAM’s gains.

Tweet media one

1

0

4

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

9/ We suspect that the Jacobian regularization has connections to label noise robustness methods that clip or regularize the magnitude of the network output as means to balance the gradients of clean and noisy examples.

1

0

3

@_christinabaek

Christina Baek @ ICML

@_christinabaek

3 months

3/ While the tie between sharp minima and generalization is contentious, it’s definitely not interesting for understanding SAM under heavy label noise. As shown above, the test accuracy increases then drops back down, so we don’t actually care about what happens at convergence.

1

0

4

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

Learned a lot about sparse coding from this project during my master's with Professor Ma. 🍊

0

0

3

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

7/ We call this approach the ALine-S method, and we can improve upon it slightly by estimating all OOD accuracies for every model jointly, which we call in the ALine-D method. See paper for details.

1

0

3

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@TheGradient @GoogleAI @yidingjiang @zicokolter Thank you for hosting! 😊 Excited to learn a lot

0

0

2

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@BlackHC @yidingjiang @zicokolter Interesting, do you happen to have any particular papers you can refer me to?

1

0

2

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@acherm @yidingjiang @zicokolter That's a great question. Our experiments were mostly on shifts where the input dimension stayed the same (e.g. shifts on CIFAR10). For datasets where the images are of different input dimensions (e.g. ImageNet), we center crop the images to the same sizes.

0

0

2

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@BlackHC @yidingjiang @zicokolter Thank you for the references. Our method utilizes information from a set of models in a way that is very distant from the way deep ensembles are used in uncertainty estimation such as in Lakshminarayanan, 2017. We're not simply looking at the average of the softmax outputs.

1

0

1

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@BlackHC @yidingjiang @zicokolter From our results, it does look like confidence based methods also perform better when accuracy is on the line. With better understanding of our method, perhaps more concrete connections to previous works can be made in the future.

0

0

1

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@MatPagliardini @yidingjiang @zicokolter It's a mystery! I do not have any concrete intuition. Also, I think this work may be related to your approach.

Tweet card media

Detecting Errors and Estimating Accuracy on Unlabeled Data with...

When a deep learning model is deployed in the wild, it can encounter test data drawn from distributions different from the training data distribution and suffer drop in performance. For safe...

0

0

1

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@BlackHC @yidingjiang @zicokolter Though connections can be made between mutual information and agreement, the way we relate ID vs OOD agreement to ID vs OOD accuracy is something quite different.

1

0

1

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@BlackHC @yidingjiang @zicokolter Why our method works is definitely a mystery. I wrote a short note in the appendix on why this phenomena seems to go beyond the requirement of the ensemble being calibrated

0

0

1

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@BlackHC @yidingjiang @zicokolter We focused our literature review on works that estimate accuracy in particular (thus generally no mention of detection other than Hendrycks). In the present, I think there’s only a very loose connection in that we have an ensemble and use agreement (MI).

0

0

1

@_christinabaek

Christina Baek @ ICML

@_christinabaek

2 years

@savvyRL @kchonyc @OriSenbazuru @ml_collective absolutely! thank you for the invite

0

0

1