The Simons Institute recently held a great workshop on LLMs -- lots of informative talks on a range of topics including studies of in-context learning, localization in LLMs, understanding emergent behavior, and many others.
Our ICML 2018 paper on training ultra-deep (10k+ layers) CNNs is now up, from work we've done at Google Brain: . We examine the relationship of trainability to signal propagation and Jacobian conditioning in networks with convolutional layers, ... (1/3)
In our preprint “The large learning rate phase of deep learning: the catapult mechanism" , we show that the choice of learning rate (LR) in (S)GD separates deep neural net dynamics into two sharply distinct types (or "phases", in the physics sense). (1/n)
The time is ripe for a move towards publishing notebooks as standard practice in scientific research. And indeed, both Mathematica (which still amazes me) and Jupyter are elegantly designed and satisfying tools to use.
Consider submission to our ICML 2021 workshop:
Overparameterization: Pitfalls & Opportunities
focused specifically on the role of overparameterization in machine learning. Organized by
@HanieSedghi
@QuanquanGu
@aminkarbasi
& myself.
Deadline: June 21
The schedule and papers for our ICML workshop "Theoretical Physics for Deep Learning" on Friday are now updated . We are looking forward to the discussions and hope you find the workshop fruitful!
Researchers may have caught a glimpse inside the black box of artificial neural networks by establishing their mathematical equivalence with older algorithms called kernel machines.
@Anilananth
reports:
The Nobel prize for
@giorgioparisi
seems to be a perfect excuse to announce the summer school we organize in July 2022 in Les Houches where statistical physics applied to understand machine learning will be at the centre of attention.
"Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent." This seems to hold up surprisingly well for more complex, non-vanilla models studied so far. The dynamics is exact in the infinite width limit (in a certain regime) ...
Listen to a diverse line-up of speakers and panelists at the "Conceptual Understanding of Deep Learning" workshop (Google organized) this Mon May 17:
@rinapy
Will be live here:
#ConceptualDLWorkshop
Large Language Models and Transformers workshop this week!
Registration for in-person attendance is now closed, but you can still join us for the livestream:
Workshop schedule:
I highly recommend this curriculum for learning about recent work in mean-field theory / random matrix theory in DL.
@vinayramasesh
,
@rileyfedmunds
, and Piyush Patil did an exceptional job dissecting these papers and creating a guide that is both pedagogical and thorough.
We are excited to debut Resurrecting the Sigmoid! It’s our 8th curriculum and was part of the DFL Jane Street Fellowship, with fellows
@vinayramasesh
,
@rileyfedmunds
and Piyush Patil. Check it out → .
(1/2) Looking forward to two exciting workshops tomorrow! I'll be on a panel at the Science Meets Engineering of DL workshop in the morning as well speaking at the ML and Physical Sciences workshop in the afternoon ...
Welcome to the Deep Learning from the Perspective of Physics and Neuroscience (
#deeplearning23
) program at
#KITP
!
Nov 13, 2023 - Dec 22, 2023
Find more information at
Watch recorded talks at
Much appreciation to
@zdeborova
&
@KrzakalaF
for a brilliantly organized school! Many stimulating discussions during my time there & I enjoyed the other lecturers' stellar, insightful lectures. It was also personally rewarding & gratifying to participate as lecturer & teach. 1/2
Our ICML workshop "Overparameterization: Pitfalls & Opportunities" will take place this Saturday. The updated schedule
and accepted papers can be found at
@QuanquanGu
@HanieSedghi
@yasamanbb
The kind of sight that would be a familiar from an experimental physics lab. IBM brought a prototype of their quantum computer to NIPS. (Most of what you see here is the cryogenics/dilution fridge.)
(1/3) Couldn't agree more. I get a lot more satisfaction from textbooks and have been on the lookout for great ones in new areas I want to learn since I've graduated.
A couple favorites from early coursework:
Intro Classical Mechanics -- Kleppner & Kolenkow
Discovering (paradoxically late in life) that I get more out of textbooks than books and that I don’t have to stop buying them just because I’m out of school. Good reading list pointers:
A spatially non-uniform kernel allows more modes of signal propagation in deep networks. Based off of this, we suggest an initialization scheme which allows us to train plain CNNs with up to 10,000 layers. (3/3)
An excellent point, and one of a number of reasons why I think it is both natural and valuable to have a part of machine learning theory be theory "in the style of theoretical physics." (Crucially, I don't mean restricted to or artificially connected to existing phenomena .....
Should we be trying to prove anything at all? While it seems much ML theory comes via physicists, *the physicist in the Deep Phenomena audience* characterizes physics as a “purely empirical” discipline, questions whether we should try to prove anything in ML.
#icml2019
Hear
@yasamanbb
next Wednesday 12pm EDT talk about the effect of large learning rate on the training of wide neural networks! What is beyond the neural tangent kernel? Only at Physics ∩ ML!
Sign up for the mailing list here !
@shoyer
Not an "intuitive" (first) explanation, but there's a way to cast it using Renormalization Group that I find pretty cool. (Roughly) Consider the N random variables as leaves of a tree and consider the iterative coarse-graining process of transforming pairs of leaves ....
I'm really looking forward to digging through this. I've always enjoyed the writing by these authors for the clarity of thought and focus on the important stuff.
For those who haven't seen:
The "test of time" award talk by Ali Rahimi was so refreshing. As a relative newcomer to ML, the alchemy has only added to my culture shock. Thank you to the awardees for giving that talk.
A kind thanks to the organizers for the opportunity to speak at DeepMath last week . It was a lot of fun to interact with and communicate to a new audience.
DeepMath 2019 was a great success . Many speakers told us that they enjoyed it and loved the fact that it brought different theoretical approaches from physics ,math, neuroscience , engineering and computer science . DeepMath 2020 is already in the plans. Stay tuned
@deepmath1
@shoyer
(e.g. X1, X2) into random variables that are the sum (X1 + X2) and difference (X1 - X2), and marginalize out the difference. This generates a recursion relation / flow amongst distributions for the RV which is the sum, whose (~unique) fixed point is the Gaussian distribution.
extending a line of earlier work using mean field theory. The spatial distribution of convolutional kernels appears to play an important role: CNNs initialized with spatially uniform conv kernels perform like fully-connected networks at large depths. .... (2/3)
Attending
#NIPS2017
(we have 2 papers in Sat workshops) and would love to chat research -- ML and physics -- with other researchers; if so, please message me!
This paper looks at the early learning period in neural networks with some interesting experiments. They examine final test performance when corrupted training data is used in the early stages of learning (until epoch T) and ...
(2/3) Principles of Quantum Mechanics -- R. Shankar
Statistical Mechanics/Field Theory: books by Frederick Reif; Mehran Kardar (2x)
Principles of Math. Analysis: Rudin
Algebra: Michael Artin
Modern Condensed Matter Theory: A. Altland & B. Simons
while high-level not as much. (2) Using Gaussian noise as the input leads to a critical period but doesn't do as much damage as blurring! (3) Deeper networks experience greater damage. (See paper for comments re learning rate in experiments.)
the max LR for this.) We find that the best generalizing networks are often obtained when the LR is chosen to lie in the catapult phase. So, we think the existence & properties of the phases are consequential for generalization & LR tuning.
A historical look at the evolution of a subfield: . The naming "solid-state" brought in industrial physicists and applied questions; later "condensed matter" unified and emphasized the fundamentality of scope. "Physics is what physicists decide it is..."
I think there is also a discussion to be had about the ways in which machine learning science might qualify as both a natural science and a design science, and how the latter would make such theory different from traditional theoretical physics.
An interesting thread. This was intended for another discipline (economics) but got me to reflect on how this manifests in other fields. In deep learning (specifically), it seems there is work on both types ("a model of X," by which I mean a single model which has widespread
A critical error that I see many grad students make: they try to estimate Frankenstein's model. Rather than viewing a model as answering a research question, they view a model as an arbitrary hodgepodge of models they learned about in their classes.
Among other things, the solvable models illustrate the dynamical mechanism underlying the catapult phase &, we hope, point towards what a more complete theory of NN dynamics would look like.
(3/3) ...and *many* others. Better to pick one subject and learn it well than dabble in too many. Great textbooks teach you the things that you remember and build on for years.
1/2 How can physics and ML inform each other? We hope to find out at Physics ∩ ML workshop
@MSFTResearch
commencing tomorrow!
Feat. awesome folks like Fields medalist Mike Freedman, Rumelhart prize winner Paul Smolensky, Sackler prize winner Mike Douglas
@michael_nielsen
One of his tour de force applications, imo, is when he uses it to solve the Kondo problem. (I believe it's here , though I don't have access!)
Indeed a unique set of lectures that was very thoughtfully curated by the organizers in and between disciplines. It's the most inter- & multi-disciplinary program I've been a part of, a challenge to accomplish well. Highly recommend!
This work is with Aitor Lewkowycz,
@ethansdyer
,
@jaschasd
, and
@guygr
. We predict these two types of dynamics (distinguished by choosing LR below or above a specific threshold) theoretically in a class of NNs with solvable dynamics. Empirically, the two phases are ...
We consider a classic framework used by theorists to study quantum systems & quantum materials — Hartree-Fock mean field theory — and we ask whether LLMs can work through the steps of such calculations for real research problems. Crucially, we draw these calculations from actual
@BlackHC
We pointed this out in , see Fig. 1 plots on the diagonal where all curves have ~the same exponent. (Apologies that this involves advertising one's own paper!) Most large-scale vision / NLP models tend to not be in this regime where you get ...
leads to flatter minima because the curvature decreases by a large amount. The distinction between the two regimes -- lazy and catapult -- becomes sharper as the networks get wider. (At yet larger LRs beyond the catapult phase, GD diverges. Our theoretical model predicts ...
observable across many settings varying architecture and training protocols -- the phenomena here are rather universal. Our notions of "small", "large" (and "divergent") LR regimes are all based on a simple measurement at initialization (!). (cont)
The small LR phase (the "lazy phase") is related to existing wide network theory (the Neural Tangent Kernel result). The large LR phase needs a different starting point for its theoretical description. We termed this phase ...
the "catapult phase" because of the nature of the dynamics that occurs. Signatures of the catapult phase include a growth in the loss early in training, before decreasing again, paired with a simultaneous decrease in local curvature. Ultimately, training in the catapult phase ...
@skornblith
I agree. I think perhaps a more natural interpretation, to have distributions over representations rather than deterministic mappings, is to instead consider ensembles of networks, where for instance the ensemble is over the random initialization.
In latter case, looking at deep linear networks disentangles effect of expressivity from potential acceleration. Key observation: dynamics induced on the the end-to-end map of the whole network is gradient descent with a particular preconditioner.
This was a great collaboration between Google DeepMind, Google Research, Harvard Physics/SEAS, & Cornell Physics (in particular noting first author Haining Pan, Michael Brenner, & Eun-Ah Kim
@eunahkim
).
The small scale atmosphere made it possible to probe and engage with the material and speakers more deeply than I've experienced in any pure ML program before. (Only in physics programs prior.) Thanks to Umesh Vazirani for creating & facilitating this workshop atmosphere.
I interpret "slow science" to mean science which relies on some good thoughts (can be, but not necessarily, slow). [Let's say "good" involves some hard thinking or a decent amount of sweating.] "It deserves revival and needs protection."
training on uncorrupted data is resumed after time T. Depending on the nature of data corruption and how long it lasts, this can lead to "irreversible damage" in the final performance. Some findings: (1) corruption in low-level statistics do lead to these "critical" periods ...
Among interesting findings I’d highlight two in particular. (i) We can categorize calculation steps into whether the results appears in the paper explicitly (since intermediate steps are often missing). We find the performance is fairly insensitive to this
@KordingLab
I recall this paper has some nice analysis on the origins of forgetting and relationship between task overlap and overwriting of learned parameters in linear models. Re your question though: why should there be no forgetting in the NTK limit? ...
and so it's interesting to pinpoint the extent of applicability at finite width. This paper is related to the nice work "Neural Tangent Kernel," , which solved the dynamics of gradient descent in the same limit in function space. We inquire about the
@Reza_Zadeh
They’re also great for putting human behavior in perspective :) From many years of watching primate documentaries, my by-far favorite way of archetyping human behavior is to find the analogy in primates. "Oh, you're *that* chimp...!"
research papers, on materials that are currently being intensely studied. This extraction process (from research papers to executable prompts) is itself non-trivial — requiring combined human-AI expertise through the design of templates — since many papers lack key parts of
dynamics in parameter space, which can be obtained through a linearization of the function at initialization. With
@hoonkp
,
@Locchiu
,
@sschoenholz
,
@jaschasd
, and Jeffrey Pennington.