It’s been a blast working with the team, some of the best researchers and engineers in the world!
Soooo proud of what we’ve done so far, and look forward to more future releases.
We’re hiring. Join us!
Announcing Grok!
Grok is an AI modeled after the Hitchhiker’s Guide to the Galaxy, so intended to answer almost anything and, far harder, even suggest what questions to ask!
Grok is designed to answer questions with a bit of wit and has a rebellious streak, so please don’t use
Always amazed by what a small team can achieve.
At xAI, our pretraining team work with infra team to debug hardware and bring up cluster, we design and build next-gen training framework to go beyond the limit of current software stack, we design achitecture, algorithms and
Excited to arrive at NeurIPS later today alongside some of my colleagues.
@xai
/
@grok
crew will have a Meet & Greet session on Thursday at 2:30pm local time by the registration desk. Drop by for some fun, giggles, and good roasts!
A comprehensive study on Bayesian inference in DNNs. I guess only within Google can you conduct such careful experiments, interesting read!
Take-away: Bayesian posterior is rather poor and prior seems to be a big problem (don't scale to large nets).
In the pretraining team, we’re hiring ppl for performance optimization, especially for frontier model inference/serving with
@MalekiSaeed
. Please apply if you are interested!
In particular, you would be a very good fit if you can write performant cuda kernels and optimize
How to train very deep NNs without shortcuts, but still achieve competitive results on ImageNet?
Our ICLR paper gives a simple solution derived from kernel approx theory. We hope this could enable further research into deep models.
Will arrive New Orleans for NeurIPS next Wednesday night and stay until Friday.
Excited to see old friends and meet new friends. Let me know if you want to meet! Particularly if you’re interested in opportunities at
@xai
One fun fact, unlike most other companies and labs, we move so fast that we never got time to write formal technical reports for all our model releases. 😄
How to train very deep NNs without shortcuts, but still achieve competitive results on ImageNet?
Our ICLR paper gives a simple solution derived from kernel approx theory. We hope this could enable further research into deep models.
The camera-ready version of Neural Kernel Network (NKN):
If you've ever worried about how to choose the kernel function for Gaussian Processes. This is the paper for you. Let data take the call!
@ssydasheng
will present it
@icmlconf
!!
New paper on studying how the critical batch size changes based on properties of the optimization algorithm (including momentum and preconditioning), through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model.
Paper got rejected by NeurIPS, but I decided to "celebrate" it anyway.
It’s known that alternating updates would avoid divergence in bilinear games, but it does not converge unless you average all the iterates.
Why do people keep complaining about how simple the model or the setting is in a theory paper (while many important advances in ML are resulted from analyses in **simple** models/settings)?
Isn't simplicity actually a virtue?
It's now officially accepted by JMLR. This is the first time I submitted to JMLR. Very good experience with high quality reviews!
Will consider submitting more to JMLR (rather than NeurIPS/ICML/...)
"A Unified Analysis of First-Order Methods for Smooth Games via Integral Quadratic Constraints", by Guodong Zhang, Xuchan Bao, Laurent Lessard, Roger Grosse.
Just saw a new arXiv paper did exactly the same study (also same results) as what I've been working on.
But I'm actually happy since I decided to give up last week and thought it was not interesting enough though I've written a 6-page draft.
New work on solving minimax optimization locally. With
@YuanhaoWang3
Jimmy Ba.
We propose a novel algorithm which converges to and only converges to local minimax. The main innovation is a correction term on top of gradient descent-ascent.
Paper link:
Finally ...... our paper on "foresight pruning" just got accepted by
@iclr_conf
. We introduced a simple, yet effective pruning criterion for pruning networks before training and related the criterion to recent NTK analysis.
#ICLR2020
@jeremyphoward
As long as you can scale the learning rate up (before it hits the limit), increasing batch size gives you perfect linear scaling without hurting generalization.
See our NQM paper () and another ICLR submission () for more details.
Finally got some time to code up the PyTorch version of noisy natural gradient. It is fun to reimplement an "old" paper. Currently, it reproduces some of the CIFAR results. Working on extending it to ImageNet, stay tuned.
K-FAC for large-batch training. To my knowledge, it's the first paper using second-order optimizer for large-batch training in ImageNet.
Very impressive! With 1024 GPUs.
#money
For those are interested in VOGN, you might also like to read my noisy natural gradient () paper, which derived the same connection between optimization and variational inference as VOGN (we also discussed K-FAC approximation except the diagonal one).
Great
#NeurIPS2019
tutorial kick-off by
@EmtiyazKhan
! Showing the unifying Bayesian Principle bridging Human & Deep Learning. Variational Online Gauss-Newton (VOGN; Osawa et al., 19‘) = A Bayesian Love Story ❤️
New paper alert:
We provide a unified and automated method to analyze first-order methods for smooth & strongly-monotone games. The convergence rate for any first-order method can be obtained via a mechanical procedure of deriving and solving an SDP.
Excited to share a new paper with Chaoqi,
@RogerGrosse
and
@FidlerSanja
: EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis, appearing at
@icmlconf
.
Paper:
Code:
To publish a paper on the top conference, you should fairly discuss existing works or compare to them in the section of experiments if they are related. It's your responsibility to tell the readers how your work compares to other methods.
I feel the stipends for Canadian schools are much lower. I got ~2000 CAD/month even with TAing for three courses in my first year. The living cost of Toronto is higher than many cities in US. Most apartments around the campus would cost you 1000+ even sharing with others.
To increase transparency around grad school stipends, retweet this tweet with your department, university, and annual stipend. I'll go first: I'm a PhD student in the Computing and Mathematical Sciences (CMS) department at Caltech, and I'm paid $36k/year.
#StipendTransparency
In the paper, we show that a simple noisy quadratic model (NQM) is remarkably consistent with the batch size effects observed in real neural networks, while allowing us to run experiments in seconds.
Roger is really a knowledgeable person and has very deep understanding about deep learning and general machine learning. I learnt a lot from him in the past two years, I believe you will too.
Registration Now Open: Introduction to Deep Learning 1: Neural Networks & Supervised Learning, created by Vector Faculty member and Canada CIFAR Research Chair,
@RogerGrosse
and taught by Vector Faculty. Learn more and register: .
@lm_zheng
and
@MalekiSaeed
pushed all the way to make inference so fast. And it was done in just a few days. In addition, all these wouldn't be possible without our 1000x engineer
@makro_ai
for building backends/infra for us. So happy to be the cheerleader.
Grok-2-mini just got a speed upgrade. Over the past few days, we have substantially improved our inference stack. These gains come from using custom algorithms for computation and communication kernels, along with more efficient batch scheduling and quantization.
Our inference
Interesting 🧵, but I think both theory and empirical discoveries are important for science. A healthy community should embrace both.
Most ML researchers are either biased towards pure theory or empirical theory. IMHO, We need put aside the distinction and do whatever useful.
My recent talk at the NSF town hall focused on the history of the AI winters, how the ML community became "anti-science," and whether the rejection of science will cause a winter for ML theory. I'll summarize these issues below...🧵
Will arrive in Vancouver for
#NeurIPS2019
a bit late (Monday night) due to a final exam.
I will present two posters (see below) in the main conference and one poster in SGO workshop (
@YuanhaoWang3
will give a 30-mins contributed talk on that).
Reach out if you'd like to chat.
The
#NeurIPS2019
camera-ready version of our NQM paper () is out! We added a new section analyzing exponential moving average (EMA). EMA accelerates training a lot with little computation overhead. REALLY surprised that EMA hasn't been widely used so far!
New paper on studying how the critical batch size changes based on properties of the optimization algorithm (including momentum and preconditioning), through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model.
We've been emphasizing too much on the positive side of our paper. To get the paper accepted, we typically hide some important parts about our research.
Now, It's time to talk about the limitations and flaws WITHOUT worrying about being rejected!
That's a really cool workshop.
Now the camera-ready version is available on ArXiv (). With our criteria, you can find the lottery ticket before training!
Code is also available online! Do check it out.
Finally ...... our paper on "foresight pruning" just got accepted by
@iclr_conf
. We introduced a simple, yet effective pruning criterion for pruning networks before training and related the criterion to recent NTK analysis.
#ICLR2020
We know well-tuned *positive* momentum can significantly speed up convergence in cooperative games (i.e., minimization problem) whereas negative momentum is preferred in simple bilinear games (i.e., pure adversarial games).
That's exactly my first research idea as a graduate student. I was trying to derive cycle-VAE with implicit prior, which turns out to be equivalent to cycle-GAN. 🤷♂️
One main takeaway from both the DKS () and our paper is that the network at initialization should really adapt to the depth in order to succeed.
In retrospect, this seems to be quite obvious. We adapt the init with the width but often overlook the depth.
I had a well-planed 2020: 1. visit IAS in the spring; 2. start my DeepMind internship in the summer; 3. Fly to China in October to attend the wedding of one of my best friends.
In the end, I only went to IAS for a week, my internship was postponed and I missed the wedding.
After my blog post on getting a Spotlight award at ICLR, as an Independent Researcher, I got nearly 2000 emails. Many have asked what happened after. – It's a painful story about losing the Google AI Residency due to COVID-19, and more. Here is that story!
Have been hearing about the generalization gap between SGD and adaptive gradient methods from many people ... However, I have never reproduced the gap in classification ... Surprisingly, I found that KFAC is able to generalize as well as SGD and even performs better sometimes.
New paper with
@Guodzh
and James Martens analyzing theoretical convergence rates of natural gradient for wide networks. Under certain conditions, it behaves like gradient descent in output space, where everything's nice, smooth, and convex.
We are excited to launch a pilot Graduate Application Assistance Program
@UofTCompSci
!
Current graduate students help review pre-submission application materials with the focus on guiding underrepresented applicants.
Details:
#2
(Thu afternoon poster
#198
): We prove the global convergence of natural gradient descent for overparameterized neural networks. The intuition is that natural gradient descent is approximately output-space gradient descent.
Just watched the first 5 mins of the video on reviewer
#2
for LR grafting paper.
I'm speechless. I really hope empirical understanding work could be treated equally as other theory papers.
Interesting 🧵, but I think both theory and empirical discoveries are important for science. A healthy community should embrace both.
Most ML researchers are either biased towards pure theory or empirical theory. IMHO, We need put aside the distinction and do whatever useful.
I'm always a big fan of energy-based generative models, like Boltzmann Machines. But now most people think generative models are all about VAEs and GANs. 😶
#icml2018
Yoshua Bengio Thoughts
(1)Explore different generative models than just GANs.
(2)Consider working on Boltzmann Machines
(3)DNNs learn pattern before memorizing noise
(4)Regularization hinders memorization
(5)Large noise favour a large volume minima over deep ones...
It's nice to see reviving interests in EBM. My first project in machine learning was actually training a hybrid model combining classifier and EBM back to 2016. Unfortunately, I didn't get it to work well at that time. Shout out to
@wgrathwohl
for the success and new insights.
@ben_golub
This is a graph I showed admitted grad students in the last visit day. The point was that they will never have as much confidence as they do right now, but with time they will regain ~75% of it back.
Have seen a lot of recent multi-agent RL theory papers motivated with the successes of self-play. However, I think self-play is more of a single-agent algorithm by playing against yourself (to collect the trajectories).
Any thoughts?
@SimonShaoleiDu
@yubai01
Through large scale experiments, we confirm that, as predicted by the NQM, preconditioning extends perfect batch size scaling to larger batch sizes than are possible with momentum SGD. Furthermore, unlike momentum, preconditioning can help at small batch sizes as well.
@yaroslavvb
We lack good benchmarks for NN optimization.
Also we should compare the whole scaling curves for different optimizers. Looking at a single BS point could be so misleading. I’ve seen so many papers claiming improving the curvature approx but run exp with small BS.
It has been shown that ResNets with batch norms are effectively shallow. I suspect the improved performance of deeper ResNets comes from ensembling.
In this sense, we haven't really figured out how to train very deep NNs.
Perhaps, I may make an even stronger argument that there is no evidence that GD (not just SGD) plays an irreplaceable role in neural network learning.
Here you are: "(Stochastic) Gradient Descent is not Necessary for Deep Learning" 😀
There's no evidence that SGD plays a fundamental role in generalization. With totally deterministic full-batch gradient descent, Resnet18 still gets >95% accuracy on CIFAR10. With data augmentation, full-batch Resnet152 gets 96.76%.
#3
(Sat 2:30 pm for contributed talk): We propose Follow-the-Ridge, a novel algorithm/dynamic that provably converges to and only converges to local minimax (Stackelberg equilibrium) in sequential games.
New paper on training with Differential Privacy (DP):
We make substantial progress in improving the accuracy of image classifiers under DP, and remarkably can almost match standard training performance when fine-tuning, even on ImageNet!
1/7
Then how about something in between (games are neither pure adversarial nor cooperative)? We show that negative momentum still accelerates the convergence locally, but with a suboptimal rate!
New work with
@YuanhaoWang3
.
#1
(Thu morning poster
#174
): We show that a simple toy model captures the essential behavior of real neural networks while allowing us to run experiments in seconds, making it easy to test new ideas for practitioners and derive new, testable theoretical results for theorists.
Please share: looking for 2 grad students (fully funded) to join my group (
@TheSalonML
) at
@UWCheritonCS
! Deadline Dec 15. More deets: . Privacy & robustness are main topics of interest. Group is inclusive & I encourage folks from all backgrounds to apply!
Blog by Fabian on acceleration without momentum. A great read for the day!!
I was amazed by that when I first read the paper by Agarwal et al. I thought acceleration has to be achieved with some sort of momentum mechanism. Turns out well-chosen step-sizes is enough.
New blog post: Acceleration without Momentum. After two blog posts on momentum, now one on how to get the same effect without it, just through some well-chosen step-sizes (🤯).
Proud to announce our paper "Functional Variational BNNs" . Here we introduce functional variational inference, which enables us to specify structured priors and perform inference in function space. Gif shows BNN predictions under a Periodic prior.
As an international student, I was fortunate to get some extra support from
@RogerGrosse
in my first year.
Later, I did a few internships, which helped a lot.
How time flies!
I studied this ML course back to early 2015. I was so excited when I finished the first homework with Matlab. Right after this course, I decided to work on machine learning and start my first project.
@fhuszar
I did a paper analyzing batch size scaling with different optimizers.
My experience is that generalization starts to suffer when lr hits the limit. Any trick allowing you to use larger lr helps generalization in large-batch training, like label-smoothing.
@aaron_defazio
@ReyhaneAskari
In my experience, the benefits of momentum in neural network training are mainly due to mini-batching (you won't see any benefit at all for very small batch sizes).
See both the paper by Shallue et al () and myself ().
Well at least, I'm safe and all my family members are safe during the pandemic. I wish 2021 would be better and hope everyone is staying safe and strong through these unusual times!
Our ICML paper "Differentiable Compositional Kernel Learning for Gaussian Processes" is now open sourced in , along with GPflow-Slim , our customized GPflow with Tensorflow-style usage.
That could really change the game. Together with vector funding, that's ~3k/month (roughly matches the stipend in the US I guess).
Really happy to see such a move!
@Guodzh
Starting with the coming year,
@UofTCompSci
has raised post-tuition take-home to ~30kCAD for MSc students and ~32kCAD for PhD students (not counting additional support from Vector, or your advisor).
Natural gradient isn't just for optimization, it can improve uncertainty modeling in variational Bayesian neural nets. Train a matrix variate Gaussian posterior using noisy K-FAC! New paper by
@jelly__zhang
and Shengyang Sun at the BDL Workshop.
Google has a significant fraction of the world's top AI talent, and yet Gmail has recently been marking as spam nearly every email from the undergraduates in my ML course.
It sometimes even spam filters emails from my grad students or replies to messages I sent.
@thegautamkamath
I bet if you could get some Chinese media (e.g. Synced
@SyncedTech
) to post your course on WeChat, you will get way more fans and views.