Guodong Zhang @Guodzh Twitter profile

Last Seen Profiles

@Dick__Dock

@KoetsuKusaka

@oldmanqueefing

@SuoloRose

@Mulhim89

@kkJaguarkk

@2a0EfmA7CAcr

@BasAljabry58819

@BinorRaja

@2a0EfmA7CAcr

@fara3_

@2a0EfmA7CAcr

@MicheleTossani

@hotrodd41

@StaplesAD

@pastel07balloo

@tallyxyz

@SenyaWild

@Franquitoo444

@Npa34

@stainch

@HusnaNaleer

@Urban_Fox_Club

@mikeryan

@fredayala

@Xiaoyan_0v0

@2a0EfmA7CAcr

@bokeplokalmalam

@inst_entreprise

@2a0EfmA7CAcr

@MissPotkin

@ThatStephC

@VGCpedia

@Hayley_xgirl18

@Jasongleason74

@IamFaisalHasan

Guodong Zhang

@Guodzh

11 months

It’s been a blast working with the team, some of the best researchers and engineers in the world! Soooo proud of what we’ve done so far, and look forward to more future releases. We’re hiring. Join us!

Careers

See our latest job posts and career opportunities.

x.ai

xAI

@xai

11 months

Announcing Grok! Grok is an AI modeled after the Hitchhiker’s Guide to the Galaxy, so intended to answer almost anything and, far harder, even suggest what questions to ask! Grok is designed to answer questions with a bit of wit and has a rebellious streak, so please don’t use

7K

8K

50K

642

468

4K

Guodong Zhang

@Guodzh

2 months

So proud of the team! We’re still a very young company and the pace is just insane!

xAI

@xai

2 months

2K

9K

29

72

791

Guodong Zhang

@Guodzh

30 days

Always amazed by what a small team can achieve. At xAI, our pretraining team work with infra team to debug hardware and bring up cluster, we design and build next-gen training framework to go beyond the limit of current software stack, we design achitecture, algorithms and

33

51

685

Guodong Zhang

@Guodzh

10 months

Join us for some fun!

Jimmy Ba

@jimmybajimmyba

10 months

Excited to arrive at NeurIPS later today alongside some of my colleagues. @xai / @grok crew will have a Meet & Greet session on Thursday at 2:30pm local time by the registration desk. Drop by for some fun, giggles, and good roasts!

347

206

711

96

44

69

Guodong Zhang

@Guodzh

5 years

A comprehensive study on Bayesian inference in DNNs. I guess only within Google can you conduct such careful experiments, interesting read! Take-away: Bayesian posterior is rather poor and prior seems to be a big problem (don't scale to large nets).

How Good is the Bayes Posterior in Deep Neural Networks Really?

During the past five years the Bayesian deep learning community has developed increasingly accurate and efficient approximate inference procedures that allow for Bayesian inference in deep neural...

arxiv.org

5

94

457

Guodong Zhang

@Guodzh

3 months

In the pretraining team, we’re hiring ppl for performance optimization, especially for frontier model inference/serving with @MalekiSaeed . Please apply if you are interested! In particular, you would be a very good fit if you can write performant cuda kernels and optimize

xAI

@xai

3 months

how june started & how it’s going come 🧑‍🍳 with us at xAI & 𝕏 if you like building & running the biggest computers in the world!

1K

2K

12K

21

87

287

Guodong Zhang

@Guodzh

4 months

time to cook!

xAI

@xai

4 months

xAI is pleased to announce..

1K

2K

10K

6

21

244

Guodong Zhang

@Guodzh

3 years

How to train very deep NNs without shortcuts, but still achieve competitive results on ImageNet? Our ICLR paper gives a simple solution derived from kernel approx theory. We hope this could enable further research into deep models.

7

49

359

Guodong Zhang

@Guodzh

4 months

if you are excited about our mission and want to have fun with 100K GPU cluster, apply .

Careers

See our latest job posts and career opportunities.

x.ai

ibab

@ibab

4 months

Apply to @xAI at if you want to work with the largest and most powerful GPU cluster ever built.

329

628

3K

13

55

277

Guodong Zhang

@Guodzh

3 months

We're hiring cuda engineers! Join us to get all the GPUs running hot!

CUDA Kernel Engineer & Researcher

Bay Area (San Francisco and Palo Alto)

boards.greenhouse.io

9

67

267

Guodong Zhang

@Guodzh

10 months

Will arrive New Orleans for NeurIPS next Wednesday night and stay until Friday. Excited to see old friends and meet new friends. Let me know if you want to meet! Particularly if you’re interested in opportunities at @xai

70

27

65

Guodong Zhang

@Guodzh

2 months

One fun fact, unlike most other companies and labs, we move so fast that we never got time to write formal technical reports for all our model releases. 😄

Guodong Zhang

@Guodzh

2 months

So proud of the team! We’re still a very young company and the pace is just insane!

29

72

791

3

44

270

Guodong Zhang

@Guodzh

7 months

😮

Grok

@grok

7 months

@elonmusk @xai ░W░E░I░G░H░T░S░I░N░B░I░O░

2K

16K

4

34

155

Guodong Zhang

@Guodzh

6 months

A lot more coming! Join us!

xAI

@xai

6 months

772

1K

7K

7

24

170

Guodong Zhang

@Guodzh

6 months

👀👏

xAI

@xai

6 months

👀

731

1K

7K

13

24

128

Guodong Zhang

@Guodzh

1 year

👀

Elon Musk

@elonmusk

1 year

Practically invisible

24K

22K

387K

30

21

240

Guodong Zhang

@Guodzh

2 months

In the pretraining, I’m looking for engineers/researchers on cuda programming, distributed training and science of DL. Apply at

Careers

See our latest job posts and career opportunities.

x.ai

xAI

@xai

2 months

2K

9K

12

47

171

Guodong Zhang

@Guodzh

3 years

Code is available online (including TF, Pytorch, and JAX).

GitHub - google-deepmind/dks: Multi-framework implementation of Deep Kernel Shaping and Tailored...

Multi-framework implementation of Deep Kernel Shaping and Tailored Activation Transformations, which are methods that modify neural network models (and their initializations) to make them easier to...

github.com

Guodong Zhang

@Guodzh

3 years

How to train very deep NNs without shortcuts, but still achieve competitive results on ImageNet? Our ICLR paper gives a simple solution derived from kernel approx theory. We hope this could enable further research into deep models.

7

49

359

2

32

174

Guodong Zhang

@Guodzh

6 years

The camera-ready version of Neural Kernel Network (NKN): If you've ever worried about how to choose the kernel function for Gaussian Processes. This is the paper for you. Let data take the call! @ssydasheng will present it @icmlconf !!

1

48

163

Guodong Zhang

@Guodzh

5 years

New paper on studying how the critical batch size changes based on properties of the optimization algorithm (including momentum and preconditioning), through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model.

2

20

117

Guodong Zhang

@Guodzh

3 years

Paper got rejected by NeurIPS, but I decided to "celebrate" it anyway. It’s known that alternating updates would avoid divergence in bilinear games, but it does not converge unless you average all the iterates.

4

6

102

Guodong Zhang

@Guodzh

1 year

I would say 99.99…% of optimization papers 👀

Lucas Beyer (bl16)

@giffmana

1 year

@ph_singer My hypothesis is that if you go back and tune the lr+wd+steps of the baseline, you might erase 80% of papers.

10

9

164

11

28

54

Guodong Zhang

@Guodzh

4 years

Why do people keep complaining about how simple the model or the setting is in a theory paper (while many important advances in ML are resulted from analyses in **simple** models/settings)? Isn't simplicity actually a virtue?

9

0

90

Guodong Zhang

@Guodzh

3 years

It's now officially accepted by JMLR. This is the first time I submitted to JMLR. Very good experience with high quality reviews! Will consider submitting more to JMLR (rather than NeurIPS/ICML/...)

Journal of Machine Learning Research

@JmlrOrg

3 years

"A Unified Analysis of First-Order Methods for Smooth Games via Integral Quadratic Constraints", by Guodong Zhang, Xuchan Bao, Laurent Lessard, Roger Grosse.

0

5

4

90

Guodong Zhang

@Guodzh

1 year

I confirm

Elon Musk (Parody)

@ElonMuskAOC

1 year

Elon Musk (Parody) has followed you!

3K

631

14K

8

22

41

Guodong Zhang

@Guodzh

4 years

Just saw a new arXiv paper did exactly the same study (also same results) as what I've been working on. But I'm actually happy since I decided to give up last week and thought it was not interesting enough though I've written a 6-page draft.

6

2

71

Guodong Zhang

@Guodzh

5 years

New work on solving minimax optimization locally. With @YuanhaoWang3 Jimmy Ba. We propose a novel algorithm which converges to and only converges to local minimax. The main innovation is a correction term on top of gradient descent-ascent. Paper link:

1

17

66

Guodong Zhang

@Guodzh

5 years

Finally ...... our paper on "foresight pruning" just got accepted by @iclr_conf . We introduced a simple, yet effective pruning criterion for pruning networks before training and related the criterion to recent NTK analysis. #ICLR2020

Picking Winning Tickets Before Training by Preserving Gradient Flow

We introduced a pruning criterion for pruning networks before training by preserving gradient flow.

openreview.net

3

10

66

Guodong Zhang

@Guodzh

5 years

@jeremyphoward As long as you can scale the learning rate up (before it hits the limit), increasing batch size gives you perfect linear scaling without hurting generalization. See our NQM paper () and another ICLR submission () for more details.

Hyperparameter Tuning and Implicit Regularization in Minibatch SGD

Smaller batch sizes can outperform very large batches on the test set under constant step budgets and with properly tuned learning rate schedules.

openreview.net

3

11

59

Guodong Zhang

@Guodzh

3 years

Finally got some time to code up the PyTorch version of noisy natural gradient. It is fun to reimplement an "old" paper. Currently, it reproduces some of the CIFAR results. Working on extending it to ImageNet, stay tuned.

GitHub - gd-zhang/NNG-Pytorch

Contribute to gd-zhang/NNG-Pytorch development by creating an account on GitHub.

github.com

1

59

Guodong Zhang

@Guodzh

6 years

K-FAC for large-batch training. To my knowledge, it's the first paper using second-order optimizer for large-batch training in ImageNet. Very impressive! With 1024 GPUs. #money

1

10

57

Guodong Zhang

@Guodzh

5 years

For those are interested in VOGN, you might also like to read my noisy natural gradient () paper, which derived the same connection between optimization and variational inference as VOGN (we also discussed K-FAC approximation except the diagonal one).

Noisy Natural Gradient as Variational Inference

Variational Bayesian neural nets combine the flexibility of deep learning with Bayesian uncertainty estimation. Unfortunately, there is a tradeoff between cheap but simple variational families...

arxiv.org

Robert Lange

@RobertTLange

5 years

Great #NeurIPS2019 tutorial kick-off by @EmtiyazKhan ! Showing the unifying Bayesian Principle bridging Human & Deep Learning. Variational Online Gauss-Newton (VOGN; Osawa et al., 19‘) = A Bayesian Love Story ❤️

7

88

337

2

8

57

Guodong Zhang

@Guodzh

4 years

New paper alert: We provide a unified and automated method to analyze first-order methods for smooth & strongly-monotone games. The convergence rate for any first-order method can be obtained via a mechanical procedure of deriving and solving an SDP.

2

8

50

Guodong Zhang

@Guodzh

5 years

Excited to share a new paper with Chaoqi, @RogerGrosse and @FidlerSanja : EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis, appearing at @icmlconf . Paper: Code:

GitHub - alecwangcq/EigenDamage-Pytorch: Code for "EigenDamage: Structured Pruning in the Kroneck...

Code for "EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis" https://arxiv.org/abs/1905.05934 - alecwangcq/EigenDamage-Pytorch

github.com

1

12

47

Guodong Zhang

@Guodzh

6 years

New codebase for noisy natural gradient () including noisy K-FAC and noisy EK-FAC (). # Done by Juhan Bae ()

GitHub - pomonam/NoisyNaturalGradient: TensorFlow implementation of "noisy K-FAC" and "noisy...

TensorFlow implementation of "noisy K-FAC" and "noisy EK-FAC". - pomonam/NoisyNaturalGradient

github.com

0

12

48

Guodong Zhang

@Guodzh

6 years

To publish a paper on the top conference, you should fairly discuss existing works or compare to them in the section of experiments if they are related. It's your responsibility to tell the readers how your work compares to other methods.

1

7

42

Guodong Zhang

@Guodzh

3 years

I feel the stipends for Canadian schools are much lower. I got ~2000 CAD/month even with TAing for three courses in my first year. The living cost of Toronto is higher than many cities in US. Most apartments around the campus would cost you 1000+ even sharing with others.

Gautam Goel

@gautamcgoel

3 years

To increase transparency around grad school stipends, retweet this tweet with your department, university, and annual stipend. I'll go first: I'm a PhD student in the Computing and Mathematical Sciences (CMS) department at Caltech, and I'm paid $36k/year. #StipendTransparency

81

189

1K

4

3

36

Guodong Zhang

@Guodzh

3 years

Used to sit next to @geoffreyhinton when I was doing my internship at Brain Toronto, was amazed how many things could be done on MNIST with matlab!

Vin Bhaskara

@vinbhaskara_

3 years

Geoff Hinton: "I'm training a Boltzmann machine on a fraction of MNIST on my Mac in MATLAB..."

2

8

137

0

1

36

Guodong Zhang

@Guodzh

5 years

In the paper, we show that a simple noisy quadratic model (NQM) is remarkably consistent with the batch size effects observed in real neural networks, while allowing us to run experiments in seconds.

1

4

36

Guodong Zhang

@Guodzh

5 years

Roger is really a knowledgeable person and has very deep understanding about deep learning and general machine learning. I learnt a lot from him in the past two years, I believe you will too.

Vector Institute

@VectorInst

5 years

Registration Now Open: Introduction to Deep Learning 1: Neural Networks & Supervised Learning, created by Vector Faculty member and Canada CIFAR Research Chair, @RogerGrosse and taught by Vector Faculty. Learn more and register: .

0

11

54

0

35

Guodong Zhang

@Guodzh

1 month

@lm_zheng and @MalekiSaeed pushed all the way to make inference so fast. And it was done in just a few days. In addition, all these wouldn't be possible without our 1000x engineer @makro_ai for building backends/infra for us. So happy to be the cheerleader.

xAI

@xai

1 month

Grok-2-mini just got a speed upgrade. Over the past few days, we have substantially improved our inference stack. These gains come from using custom algorithms for computation and communication kernels, along with more efficient batch scheduling and quantization. Our inference

352

418

3K

1

3

35

Guodong Zhang

@Guodzh

3 years

Interesting 🧵, but I think both theory and empirical discoveries are important for science. A healthy community should embrace both. Most ML researchers are either biased towards pure theory or empirical theory. IMHO, We need put aside the distinction and do whatever useful.

Tom Goldstein

@tomgoldsteincs

3 years

My recent talk at the NSF town hall focused on the history of the AI winters, how the ML community became "anti-science," and whether the rejection of science will cause a winter for ML theory. I'll summarize these issues below...🧵

28

217

924

3

2

33

Guodong Zhang

@Guodzh

5 years

Will arrive in Vancouver for #NeurIPS2019 a bit late (Monday night) due to a final exam. I will present two posters (see below) in the main conference and one poster in SGO workshop ( @YuanhaoWang3 will give a 30-mins contributed talk on that). Reach out if you'd like to chat.

1

2

32

Guodong Zhang

@Guodzh

5 years

The #NeurIPS2019 camera-ready version of our NQM paper () is out! We added a new section analyzing exponential moving average (EMA). EMA accelerates training a lot with little computation overhead. REALLY surprised that EMA hasn't been widely used so far!

Which Algorithmic Choices Matter at Which Batch Sizes? Insights...

Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the...

arxiv.org

Guodong Zhang

@Guodzh

5 years

New paper on studying how the critical batch size changes based on properties of the optimization algorithm (including momentum and preconditioning), through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model.

2

20

117

1

9

31

Guodong Zhang

@Guodzh

5 years

We've been emphasizing too much on the positive side of our paper. To get the paper accepted, we typically hide some important parts about our research. Now, It's time to talk about the limitations and flaws WITHOUT worrying about being rejected! That's a really cool workshop.

Behzad

@bg01shan

5 years

Encouraging researchers to talk about what part of their research was/is imperfect is such a good idea! @MLRetrospective

0

6

38

0

3

31

Guodong Zhang

@Guodzh

5 years

Now the camera-ready version is available on ArXiv (). With our criteria, you can find the lottery ticket before training! Code is also available online! Do check it out.

GitHub - alecwangcq/GraSP: Code for "Picking Winning Tickets Before Training by Preserving Gradient...

Code for "Picking Winning Tickets Before Training by Preserving Gradient Flow" https://openreview.net/pdf?id=SkgsACVKPH - alecwangcq/GraSP

github.com

Guodong Zhang

@Guodzh

5 years

Finally ...... our paper on "foresight pruning" just got accepted by @iclr_conf . We introduced a simple, yet effective pruning criterion for pruning networks before training and related the criterion to recent NTK analysis. #ICLR2020

3

10

66

0

10

28

Guodong Zhang

@Guodzh

4 years

Life is hard - one lesson I learned in the last four years. Just look at the bright/positive side of it - another lesson I learned.

0

25

Guodong Zhang

@Guodzh

2 months

@lm_zheng @xai 👀

0

25

Guodong Zhang

@Guodzh

4 years

We know well-tuned *positive* momentum can significantly speed up convergence in cooperative games (i.e., minimization problem) whereas negative momentum is preferred in simple bilinear games (i.e., pure adversarial games).

1

3

23

Guodong Zhang

@Guodzh

3 years

@giffmana @liuzhuang1234 A nice figure from the short-horizon paper ().

2

22

Guodong Zhang

@Guodzh

6 years

That's exactly my first research idea as a graduate student. I was trying to derive cycle-VAE with implicit prior, which turns out to be equivalent to cycle-GAN. 🤷‍♂️

Stat.ML Papers

@StatMLPapers

6 years

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference. (arXiv:1806.01771v3 [] UPDATED)

0

4

14

2

0

22

Guodong Zhang

@Guodzh

3 years

One main takeaway from both the DKS () and our paper is that the network at initialization should really adapt to the depth in order to succeed. In retrospect, this seems to be quite obvious. We adapt the init with the width but often overlook the depth.

Rapid training of deep neural networks without skip connections or...

Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that...

arxiv.org

1

3

21

Guodong Zhang

@Guodzh

30 days

@QuanquanGu Not AdamW, lol

2

0

21

Guodong Zhang

@Guodzh

4 years

I had a well-planed 2020: 1. visit IAS in the spring; 2. start my DeepMind internship in the summer; 3. Fly to China in October to attend the wedding of one of my best friends. In the end, I only went to IAS for a week, my internship was postponed and I missed the wedding.

Andreas Madsen

@andreas_madsen

4 years

After my blog post on getting a Spotlight award at ICLR, as an Independent Researcher, I got nearly 2000 emails. Many have asked what happened after. – It's a painful story about losing the Google AI Residency due to COVID-19, and more. Here is that story!

33

107

638

1

0

20

Guodong Zhang

@Guodzh

6 years

Have been hearing about the generalization gap between SGD and adaptive gradient methods from many people ... However, I have never reproduced the gap in classification ... Surprisingly, I found that KFAC is able to generalize as well as SGD and even performs better sometimes.

3

2

19

Guodong Zhang

@Guodzh

4 years

Happy Chinese New Year!

Chinese New Year’s decorations, goods and snacks！明日除夕，挂灯笼、贴对联、备好年货过...

Please subscribe to 【李子柒 Liziqi 】Liziqi Channel on YouTube if you like my videos: https://goo.gl/nkjpSx #李子柒#Liziqi#lýtửthấ#ChineseCuisine #ChineseFoodTomor...

www.youtube.com

0

19

Guodong Zhang

@Guodzh

4 years

The funny thing is I actually find it interesting when I read the exact same results but in a paper written by others.

0

16

Guodong Zhang

@Guodzh

4 years

What is something that isn't a holiday but really feels like one? Me: every weekday as a PhD student.

Prathyush Sambaturu

@prathyushspeaks

4 years

What is something that isn’t a workday but really feels like one? I’ll go first: every weekend in academia. @AcademicChatter #phdchat

12

58

625

2

0

18

Guodong Zhang

@Guodzh

2 years

"A person's fate depends on their own efforts, but it is also largely influenced by the course of history." - Jiang Zemin Translated by ChatGPT

0

13

16

Guodong Zhang

@Guodzh

5 years

Finally ....

Roger Grosse

@RogerGrosse

5 years

New paper with @Guodzh and James Martens analyzing theoretical convergence rates of natural gradient for wide networks. Under certain conditions, it behaves like gradient descent in output space, where everything's nice, smooth, and convex.

1

27

139

0

1

17

Guodong Zhang

@Guodzh

1 year

AI monopoly v.s. AI arms race, which is more dangerous?

8

10

15

Guodong Zhang

@Guodzh

3 years

Please spread the word!

Michael Zhang

@michaelrzhang

3 years

We are excited to launch a pilot Graduate Application Assistance Program @UofTCompSci ! Current graduate students help review pre-submission application materials with the focus on guiding underrepresented applicants. Details:

1

41

153

1

0

15

Guodong Zhang

@Guodzh

5 years

#2 (Thu afternoon poster #198 ): We prove the global convergence of natural gradient descent for overparameterized neural networks. The intuition is that natural gradient descent is approximately output-space gradient descent.

1

15

Guodong Zhang

@Guodzh

5 years

The NQM successfully predicts that momentum should speed up training relative to plain SGD at larger batch sizes, but do nothing at small batch sizes.

1

14

Guodong Zhang

@Guodzh

6 years

Curious if K-FAC would benefit more from large batch training 🤔

Matthew Johnson

@SingularMattrix

6 years

Measuring the effects of data parallelism on neural network training. A great example of careful science in machine learning.

0

83

322

0

3

15

Guodong Zhang

@Guodzh

4 years

His recorded lectures on youtube are good too. I love his "five miracles of mirror descent" and "bandit convex optimization". Highly recommend!

1

14

Guodong Zhang

@Guodzh

3 years

Just watched the first 5 mins of the video on reviewer #2 for LR grafting paper. I'm speechless. I really hope empirical understanding work could be treated equally as other theory papers.

Learning Rate Grafting: Transferability of Optimizer Tuning (Machine...

#grafting #adam #sgdThe last years in deep learning research have given rise to a plethora of different optimization algorithms, such as SGD, AdaGrad, Adam, ...

www.youtube.com

Guodong Zhang

@Guodzh

3 years

Interesting 🧵, but I think both theory and empirical discoveries are important for science. A healthy community should embrace both. Most ML researchers are either biased towards pure theory or empirical theory. IMHO, We need put aside the distinction and do whatever useful.

3

2

33

1

14

Guodong Zhang

@Guodzh

5 years

Really enjoy the week in California 🌞. Gonna leave for Toronto and go back to work💪

0

14

Guodong Zhang

@Guodzh

6 years

I'm always a big fan of energy-based generative models, like Boltzmann Machines. But now most people think generative models are all about VAEs and GANs. 😶

DSNai-Data Science Nigeria/Data Scientists Network

@dsn_ai_network

6 years

#icml2018 Yoshua Bengio Thoughts (1)Explore different generative models than just GANs. (2)Consider working on Boltzmann Machines (3)DNNs learn pattern before memorizing noise (4)Regularization hinders memorization (5)Large noise favour a large volume minima over deep ones...

0

12

34

2

14

Guodong Zhang

@Guodzh

5 years

It's nice to see reviving interests in EBM. My first project in machine learning was actually training a hybrid model combining classifier and EBM back to 2016. Unfortunately, I didn't get it to work well at that time. Shout out to @wgrathwohl for the success and new insights.

0

14

Guodong Zhang

@Guodzh

4 years

Every once for a while, I feel like a newbie in research and start to doubt how I got papers published.

Boaz Barak

@boazbaraktcs

4 years

@ben_golub This is a graph I showed admitted grad students in the last visit day. The point was that they will never have as much confidence as they do right now, but with time they will regain ~75% of it back.

2

17

104

0

13

Guodong Zhang

@Guodzh

3 years

Have seen a lot of recent multi-agent RL theory papers motivated with the successes of self-play. However, I think self-play is more of a single-agent algorithm by playing against yourself (to collect the trajectories). Any thoughts? @SimonShaoleiDu @yubai01

1

13

Guodong Zhang

@Guodzh

5 years

Through large scale experiments, we confirm that, as predicted by the NQM, preconditioning extends perfect batch size scaling to larger batch sizes than are possible with momentum SGD. Furthermore, unlike momentum, preconditioning can help at small batch sizes as well.

0

12

Guodong Zhang

@Guodzh

5 months

@HeinrichKuttler @elonmusk @ibab @xai Welcome to the team!

0

12

Guodong Zhang

@Guodzh

3 years

@yaroslavvb We lack good benchmarks for NN optimization. Also we should compare the whole scaling curves for different optimizers. Looking at a single BS point could be so misleading. I’ve seen so many papers claiming improving the curvature approx but run exp with small BS.

2

1

11

Guodong Zhang

@Guodzh

3 years

It has been shown that ResNets with batch norms are effectively shallow. I suspect the improved performance of deeper ResNets comes from ensembling. In this sense, we haven't really figured out how to train very deep NNs.

1

0

10

Guodong Zhang

@Guodzh

3 years

Perhaps, I may make an even stronger argument that there is no evidence that GD (not just SGD) plays an irreplaceable role in neural network learning. Here you are: "(Stochastic) Gradient Descent is not Necessary for Deep Learning" 😀

Tom Goldstein

@tomgoldsteincs

3 years

There's no evidence that SGD plays a fundamental role in generalization. With totally deterministic full-batch gradient descent, Resnet18 still gets >95% accuracy on CIFAR10. With data augmentation, full-batch Resnet152 gets 96.76%.

29

173

903

1

0

11

Guodong Zhang

@Guodzh

5 years

#3 (Sat 2:30 pm for contributed talk): We propose Follow-the-Ridge, a novel algorithm/dynamic that provably converges to and only converges to local minimax (Stackelberg equilibrium) in sequential games.

0

11

Guodong Zhang

@Guodzh

2 years

Amazing results for private deep learning!

Soham De

@sohamde_

2 years

New paper on training with Differential Privacy (DP): We make substantial progress in improving the accuracy of image classifiers under DP, and remarkably can almost match standard training performance when fine-tuning, even on ImageNet! 1/7

6

24

149

0

10

Guodong Zhang

@Guodzh

3 years

Something I've been looking for for a while, converting animated PowerPoint into animated PDF! Works nicely!

Convert an Animated PowerPoint Presentation to an Animated PDF | Part...

This video tutorial shows you how to convert an animated PowerPoint presentation to an animated PDF NEW!! check also this part 2 video (https://www.youtube.c...

www.youtube.com

0

3

10

Guodong Zhang

@Guodzh

4 years

Then how about something in between (games are neither pure adversarial nor cooperative)? We show that negative momentum still accelerates the convergence locally, but with a suboptimal rate! New work with @YuanhaoWang3 .

On the Suboptimality of Negative Momentum for Minimax Optimization

Smooth game optimization has recently attracted great interest in machine learning as it generalizes the single-objective optimization paradigm. However, game dynamics is more complex due to the...

arxiv.org

0

10

Guodong Zhang

@Guodzh

5 years

#1 (Thu morning poster #174 ): We show that a simple toy model captures the essential behavior of real neural networks while allowing us to run experiments in seconds, making it easy to test new ideas for practitioners and derive new, testable theoretical results for theorists.

2

0

10

Guodong Zhang

@Guodzh

3 years

Come to Canada! Great opportunity to work on trustworthy machine learning.

Gautam Kamath

@thegautamkamath

3 years

Please share: looking for 2 grad students (fully funded) to join my group ( @TheSalonML ) at @UWCheritonCS ! Deadline Dec 15. More deets: . Privacy & robustness are main topics of interest. Group is inclusive & I encourage folks from all backgrounds to apply!

3

50

136

1

0

9

Guodong Zhang

@Guodzh

3 years

Blog by Fabian on acceleration without momentum. A great read for the day!! I was amazed by that when I first read the paper by Agarwal et al. I thought acceleration has to be achieved with some sort of momentum mechanism. Turns out well-chosen step-sizes is enough.

Fabian Pedregosa

@fpedregosa

3 years

New blog post: Acceleration without Momentum. After two blog posts on momentum, now one on how to get the same effect without it, just through some well-chosen step-sizes (🤯).

1

60

308

0

9

Guodong Zhang

@Guodzh

6 years

nice GIFs!

Shengyang Sun

@ssydasheng

6 years

Proud to announce our paper "Functional Variational BNNs" . Here we introduce functional variational inference, which enables us to specify structured priors and perform inference in function space. Gif shows BNN predictions under a Periodic prior.

2

37

200

0

9

Guodong Zhang

@Guodzh

3 years

As an international student, I was fortunate to get some extra support from @RogerGrosse in my first year. Later, I did a few internships, which helped a lot.

1

0

9

Guodong Zhang

@Guodzh

4 years

How time flies! I studied this ML course back to early 2015. I was so excited when I finished the first homework with Matlab. Right after this course, I decided to work on machine learning and start my first project.

0

9

Guodong Zhang

@Guodzh

4 years

This course is SOOOOOOOOOO GOOOOOOOD!!! I already love it after watching the first three intro videos.

Ryan O'Donnell

@BooleanAnalysis

4 years

Starting to put my videos also onto Bilibili. First one up is here: 喜欢并订阅 ! <--- did I get it right? 😊

11

50

317

2

0

9

Guodong Zhang

@Guodzh

5 years

@fhuszar I did a paper analyzing batch size scaling with different optimizers. My experience is that generalization starts to suffer when lr hits the limit. Any trick allowing you to use larger lr helps generalization in large-batch training, like label-smoothing.

Which Algorithmic Choices Matter at Which Batch Sizes? Insights...

Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the...

arxiv.org

2

8

Guodong Zhang

@Guodzh

3 years

@aaron_defazio @ReyhaneAskari In my experience, the benefits of momentum in neural network training are mainly due to mini-batching (you won't see any benefit at all for very small batch sizes). See both the paper by Shallue et al () and myself ().

1

8

Guodong Zhang

@Guodzh

4 years

Well at least, I'm safe and all my family members are safe during the pandemic. I wish 2021 would be better and hope everyone is staying safe and strong through these unusual times!

0

8

Guodong Zhang

@Guodzh

6 years

We release the code for NKN.

Shengyang Sun

@ssydasheng

6 years

Our ICML paper "Differentiable Compositional Kernel Learning for Gaussian Processes" is now open sourced in , along with GPflow-Slim , our customized GPflow with Tensorflow-style usage.

1

11

52

0

1

8

Guodong Zhang

@Guodzh

4 years

Is there any standard convex Lipschitz function (non-quadratic) for benchmarking convex-optimization algorithms?

2

1

8

Guodong Zhang

@Guodzh

1 year

Why does Microsoft CMT only have conferences after 2021? Where I can find my old paper submissions/reviews? Say NeurIPS 2019.

1

5

Guodong Zhang

@Guodzh

5 years

@_arohan_ @RogerGrosse @tomerikoriko @zacharynado Honestly, the title reads like you’re proposing the first practical second-order optimizer.

1

0

7

Guodong Zhang

@Guodzh

3 years

That could really change the game. Together with vector funding, that's ~3k/month (roughly matches the stipend in the US I guess). Really happy to see such a move!

Sushant Sachdeva

@sushnt

3 years

@Guodzh Starting with the coming year, @UofTCompSci has raised post-tuition take-home to ~30kCAD for MSc students and ~32kCAD for PhD students (not counting additional support from Vector, or your advisor).

4

3

26

2

0

7

Guodong Zhang

@Guodzh

7 years

Better optimizer & more flexible posterior! Get both in one framework.

Roger Grosse

@RogerGrosse

7 years

Natural gradient isn't just for optimization, it can improve uncertainty modeling in variational Bayesian neural nets. Train a matrix variate Gaussian posterior using noisy K-FAC! New paper by @jelly__zhang and Shengyang Sun at the BDL Workshop.

1

15

91

0

3

7

Guodong Zhang

@Guodzh

3 years

Basically have to check my spam folder everyday.🤷🏻‍♂️

Roger Grosse

@RogerGrosse

3 years

Google has a significant fraction of the world's top AI talent, and yet Gmail has recently been marking as spam nearly every email from the undergraduates in my ML course. It sometimes even spam filters emails from my grad students or replies to messages I sent.

45

36

626

0

6

Guodong Zhang

@Guodzh

4 years

@thegautamkamath I bet if you could get some Chinese media (e.g. Synced @SyncedTech ) to post your course on WeChat, you will get way more fans and views.

0

6