Finally launched !
The mathematics of deep learning is profound, beautiful, and unreasonably effective. Developing the "theory of everything" for large neural networks will be central to taking AI to the next level. Conversely, this AI will enable everyone
Since folks are asking:
The books I mentioned on
@xai
spaces are "Linear Algebra Done Right" by Axler and "Naive Set Theory" by Halmos. Other math books that I really enjoyed over the years:
"Introduction to Algorithms" by Thomas H. Cormen & Charles E. Leiserson & Ronald L.
Grok LFG🚀🚀🚀
Last few weeks been some of the best time of my life, fr fr
When a small, motivated group of world class people all push in the same direction, they punch way above their weight. I really did not appreciate this enough a year ago, but now
You asked for it...a dump of my book collection, in rough chronological order
(1/2)
"Naive Set Theory" - Paul R Halmos
"Linear Algebra Done Right Second Edition" - Sheldon Axler
"Mixing Secrets for the Small Studio" - Mike Senior
"Introduction to Algorithms, Third Edition" -
Nontrivial ∞width neural nets are either kernel machines or feature learners. Latter's scaling makes optimal hyperparams invariant to width
What if depth→∞as well?
🆕 Feature diversity is key; maxed out by abs (not relu); gives invariance to depth!
But GPT flawed 🧵
These are some of the UI features in Grok. First, it allows you to multi-task. You can run several concurrent conversations and switch between them as they progress.
1/ You can't train GPT-3 on a single GPU, much less tune its hyperparameters (HPs).
But what if I tell you…
…you *can* tune its HPs on a single GPU thanks to new theoretical advances?
paper
code
blog
1/ The histogram of eigenvals in a large random symmetric matrix ≈ a semicircle!! So sick! This "Semicircle Law" is essentially "Central Limit" for rand symmetric mats (even more elegant bc u knew what a semicircle is by 1st grade, but wtf was a Gaussian?). Let me tell ya why
This weekend, the
@xAI
team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days.
Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months.
Excellent
So my trick for reading and grokking all the foundational textbooks intently is...
Anki flash cards
...ie spaced repetition. Works really well for knowledge you know you will need in the future
I took a leave from college (aside from DJing) just to crawl libgen and read textbooks cover to cover and making anki flash cards to retain those knowledge. Absolutely one of the best periods of my life because you can feel the rapid self improvement. Taking classes in school in
You asked for it...a dump of my book collection, in rough chronological order
(1/2)
"Naive Set Theory" - Paul R Halmos
"Linear Algebra Done Right Second Edition" - Sheldon Axler
"Mixing Secrets for the Small Studio" - Mike Senior
"Introduction to Algorithms, Third Edition" -
1/ μP is optimal scaling rule of learning rate & init as network width → ∞. Been confused?
🆕μP = holding the "natural" (I'll explain) operator
norm constant for every weight W & its updates ΔW:
μP <=> ‖W‖_nat = Θ(1) = ‖ΔW‖_nat.
🆕Frobenius norm is the wrong norm to measure!
Do models need to reason in words to benefit from chain-of-thought tokens?
In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens.
This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧵
1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper or the code . The proof has two parts…
Training a neural network (NN) can suffer from bad local minima. But as the NN gets wider, its optimization landscape in *function space* converges & becomes convex; when width=∞, this convex landscape is described by Neural Tangent Kernel.
Looking for top engineers and designers passionate about harnessing our AI capabilities to create never-before-seen consumer products.
🛼 come roll w us!
Serious mathematics underlies the feature learning limit of wide neural networks, which made it possible to tune large models by tuning small ones. I'll be explaining this on Wednesday on . Sign up here!
(2/2)
"Additive Combinatorics (Cambridge Studies in Advanced Mathematics)" - Terence Tao
"Lie Groups: An Approach Through Invariants and Representations (Universitext)" - Claudio Procesi
"Algebraic Geometry in Coding Theory and Cryptography" - HARALD NIEDERREITER & CHAOPING XING
1/ Crazy exp: take Resnet embedding of Imagenet as dataset A. Train linear predictor on A; get accuracy p. Now make fake dataset B = a mixture of Gaussians w/ same class mean & covariance as A. Train linear predictor on B => get *SAME ACCURACY* p. WTF
So essentially ML researcher=rapper🤣:
Paper=single
Book=album
Arxiv=SoundCloud
Elsevier=Spotify
Blogpost=music video
universities/industrial labs=labels
Conference=music festival
Plenary speaker=headliner
ICML=coachella
Neurips=lollapalooza
ICLR=rolling loud
...
What else?🤣🤣
1/ You can't comb a ball of hair without having some hair sticking up -- this is known as the "Hairy Ball Theorem" (no joke). Since wind on earth is like (the projection of) hair on a ball, this theorem implies that there is always a place with no wind! Let me tell ya why ↓
I'm looking for a phd intern that will work with me on the theory of infinite size neural networks beyond width and applications to hyperparameter transfer and design of large scale neural networks. Email me at gregyang at Microsoft dot com with your CV and a blurb about yourself
I took a leave from college (aside from DJing) just to crawl libgen and read textbooks cover to cover and making anki flash cards to retain those knowledge. Absolutely one of the best periods of my life because you can feel the rapid self improvement. Taking classes in school in
@emollick
A single intelligence that has a real time pulse on what is happening and what could happen in the future. Many ways to sell this but it easily is a massive use case for biz. E.g. Bloomberg already resells X data at a big price tag and Grok will be this but on an entirely
Neural networks tend to Gaussian processes (GPs) as their widths tend to infinity --- now you can play with these GP kernels in
@GoogleColab
! Try out RNN-GP, GRU-GP, Transformer-GP, or Batchnorm-GP today!
Repo:
Colab Entry Point:
this team ships
if you love the product, understand the vision, and want the challenge of life time- should really consider working at X
some high impact positions
* client eng - ios / android / web
* infra eng - k8s / supercomputing / large distributed systems / network /
uv is very very fast. Super nice for when I ssh into a pod and just want to quickly install all dependencies and run some python script
good job
@charliermarsh
et al
1/ Existing theories of neural networks (NN) like NTK don't learn features so can't explain success of pretraining (e.g. BERT, GPT3). We derive the *feature learning* ∞-width limit of NNs & pretrained such an ∞-width word2vec model: it learned semantics!
1/ I can't teach you how to dougie but I can teach you how to compute the Gaussian Process corresponding to infinite-width neural network of ANY architecture, feedforward or recurrent, eg: resnet, GRU, transformers, etc ... RT plz💪
Talagrand has an aptitude for distilling complex insights into clear and tasty bites in his research and exposition. I've benefited immensely from applying his most famous inequality but even more so from his books, which are written with wit and lucidity matched by none:
Upper
Michel Talagrand has been awarded the Abel Prize, one of the highest honors in mathematics, for applying tools from high-dimensional geometry to complex probability problems.
@jordanacep
reports:
I am delighted to announce publication of the 4th edition of Linear Algebra Done Right as an Open Access book. The electronic version is legally free to the world at .
That website also has links to pre-order the print version of the book.
#linearalgebra
1/ Does batchnorm make optimization landscape more smooth? says yes, but our new
@iclr2019
paper shows BN causes grad explosion in randomly initialized deep BN net. Contradiction? We clarify below
1/ How to scale hyperparams (eg learning rate) as neural network gets wider? Esp w/ adaptive optimizers like Adam?
I derived the answer (μP) in 2020 & verified it on GPT3
This required some beautiful new math that’s just been completely written down w/
@EtaiLittwin
🧵👇
We deepen a mysterious connection btw
#topology
&
#learning
in new paper appearing in Advances in Applied Mathematics Somehow # of samples needed to learn = the highest dimension of holes in some topological space! I wrote the paper but I'm still like WTF
1/ Gradients improve weights, so they better depend on the weights, right? Somehow, for calculating e.g. grad norm or NTK at init, grads might as well be backproped by random weights, independent from those used in forward pass. WTF? Let me explain (from )
Really cool library giving a type system to different kinds of matrices to speed up linear algebra! I really resonate with this because a key idea of Tensor Programs is that different (random) matrices have entirely different "types" (like random init vs gradients); if you track
We're ecstatic to officially announce our new library, CoLA! CoLA is a framework for large-scale linear algebra in machine learning and beyond, supporting PyTorch and JAX.
repo:
paper:
w/ amazing
@m_finzi
, Andres Potap, Geoff Pleiss
1/8 Modern deep networks (with conv, (self-)attention, batchnorm, LSTM, etc) become Gaussian Processes when randomly initialized, as their widths grow to infinity. This and more are shown in my new paper . SOTA GPs here we come,
@Jasch
?
1/ The nonzero singular values histogram of a large square random matrix looks like a "quarter circle", sticking to the y-axis. However, if the sides are not equal, then the histogram "buds off" from the y-axis. In any case, we still can calculate the asymptotic shape of it!
Nontrivial ∞width neural nets are either kernel machines or feature learners. Latter's scaling makes optimal hyperparams invariant to width
What if depth→∞as well?
🆕 Feature diversity is key; maxed out by abs (not relu); gives invariance to depth!
But GPT flawed 🧵
You can now train your own feature learning infinite-width neural networks on word2vec and metalearning (w/ MAML) ! Our paper "Feature Learning in Infinite-Width Neural Networks" will also appear in ICML 2021. Cya there!
@edwardjhu
My cliff notes vers of holography <-> quantum error correction:
Holography says information in interior of universe can be recovered from info on its boundary, just like a message can be recovered from its encoding by error correcting code.
Actual error correction comes from
1/ It's exciting when an "applied" area feeds back to pure math. e.g. Witten's new proof of Positive Energy Thm by physics won him a Fields Medal. A reason I'm rly hyped about Tensor Programs: new proof of Semicircle Law by "neural network arguments"
GUYS DID YOU KNOW THE RED WEDDING OF
#GoT
HAPPENED IN FRANCE IN 1572?
The Catholic queen mother forced her daughter to marry a protestant prince, invited all the protestant nobilities to the wedding at the predominantly catholic Paris, then bodied them all.
Aka St
RNNs and batchnorm will be coming soon, but you can already play with them here The general theory for this is based on tensor programs
Give Neural Tangents a try and let us know what you think!
Announcing Neural Tangents, a new easy-to-use, open-source neural network library that enables researchers to build finite- and infinite-width versions of neural networks simultaneously. Grab the code and try it for yourself at
1/ In a neural network, activation vectors depend on the weight matrices in really complex, nonlinear ways. New paper : the activations are "independent" from the weights in a randomly initialized wide NN of any architecture! WTF!!
Our society is built on trust:
* human-human trust
* human-organization trust
* human-machine trust
It would run so slowly without trust.
It takes years to build but an instant to destroy.
We really need to treasure it!
What an incredible journey it's been over these 5+ yrs
@MSFTResearch
. I still remember the eureka moments, in the serenity of building 99 past midnight, leading to Tensor Programs & μP. Forever grateful for MSR taking a chance on a kid straight out of undergrad.
1/ Neural network (NN) parametrization is super important folks!! The wrong param -- e.g. NTK, or, in fact, the pytorch/tensorflow defaults -- can make you diverge or prevent you from learning features in wide NNs!
1/ I reveal the evolution under gradient descent of neural network of *any architecture*, by showing how to compute its tangent kernel (NTK). This includes RNN, transformer, resnet, GANs, Faster RCNN, and more! Let's have theory catch up to practice!
1/2 How can physics and ML inform each other? We hope to find out at Physics ∩ ML workshop
@MSFTResearch
commencing tomorrow!
Feat. awesome folks like Fields medalist Mike Freedman, Rumelhart prize winner Paul Smolensky, Sackler prize winner Mike Douglas
1/ Neural networks evolve like linear models just because 1st order taylor expansion -- key intuition behind
#NeuralTangentKernels
. What's nontrivial & surprising: a *wide* NN can fit *any* data without moving params too much as to break the approximation of the taylor expansion.
Infinitely-wide recurrent networks (i.e. RNN Neural Tangent Kernel) are good at time series prediction with low data, whodvethought! Such calculation with infinite-width RNN wouldn't have been possible without Tensor Programs!
1/ A ∞-wide NN of *any architecture* is a Gaussian process (GP) at init. The NN in fact evolves linearly in function space under SGD, so is a GP at *any time* during training. With Tensor Programs, we can calculate this time-evolving GP w/o trainin any NN
1/4 Batchnorm causes grad explosion in random-init MLP! Can’t fix this by changing nonlinearities! Relu+batchnorm explodes grad norm^2 by >=1.47 per layer, but linear activation minimizes the explosion rate at (B-2)/(B-3), B=batchsize. Our ICLR 2019 paper
Learnability (VC dim) is a *topological property*, as I proved in for parity, conjunctions, poly threshold fctns. Now this extends to downward-closed classes, conjunction of parities, and k-CNFs, as well! Just how far does this go?
Often, VC dimension of a concept class (“how many samples needed to learn a pattern?”) in
#learning
theory can be recovered from the *
#algebraic
#topology
* of the class (“What are the holes in this topological space?”). Beautiful and mysterious phenomenon!
1/ Neural networks are Gaussian Processes --- the Poster Edition from
#NeurIPS2019
last week. In case you missed it, here’s a twitter version of the poster presentation, following the format of
@colinraffel
; and here’s the previous tweet thread
1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper or the code . The proof has two parts…