Greg Yang @TheGregYang Twitter profile | Pikagi

Pikagi

Greg Yang

@TheGregYang

58,024

Followers

686

Following

334

Media

1,431

Statuses

Cofounder . Morgan Prize Honorable Mention 2018. Developing the theory of #TensorPrograms and the practice of scaling #neuralnetworks .

https://t.co/BefmOO4V9l

Joined February 2019

Don't wanna be here? Send us removal request.

Pinned Tweet

@TheGregYang

Greg Yang

1 year

Finally launched ! The mathematics of deep learning is profound, beautiful, and unreasonably effective. Developing the "theory of everything" for large neural networks will be central to taking AI to the next level. Conversely, this AI will enable everyone

Tweet card media

xAI is an AI company with the mission of advancing scientific discovery and gaining a deeper understanding of our universe. Our first product is Grok - a conversational AI.

441

903

7K

Last Seen Profiles

@ko_chan_bankoku

@xxnghtclls

@DunklerErpel

@CryptoAzulene

@rap_trades

@AldapaSo

@JuncalArbelaiz

@alfiatauseef

@Colorbl95466996

@timberhockey

@DailyAraneaS

@UNDRFTDpod

@BandanaTraining

@ncbnntr

@CaltechLive

@luckey_twitt

@TakanoriMaede

@viajestic

@BinorRaja

@shamo_tx

@Raeza_Kig

@Molfar_4308

@Israel_25_

@_jamesmeek

@sunnyjirou

@mojgan07798961

@tajabarber

@MaryCMLeon

@BinorRaja

@JackNovak44_

@nminy122165

@stw_pdg

@stwmaniax

@BinorRaja

@Astin34146203

@hiamandag

@TheGregYang

Greg Yang

1 year

Since folks are asking: The books I mentioned on @xai spaces are "Linear Algebra Done Right" by Axler and "Naive Set Theory" by Halmos. Other math books that I really enjoyed over the years: "Introduction to Algorithms" by Thomas H. Cormen & Charles E. Leiserson & Ronald L.

257

871

6K

@TheGregYang

Greg Yang

11 months

Grok LFG🚀🚀🚀 Last few weeks been some of the best time of my life, fr fr When a small, motivated group of world class people all push in the same direction, they punch way above their weight. I really did not appreciate this enough a year ago, but now

Tweet card media

xAI is an AI company with the mission of advancing scientific discovery and gaining a deeper understanding of our universe. Our first product is Grok - a conversational AI.

194

331

4K

@TheGregYang

Greg Yang

11 months

Ball so hard...

Tweet media one

223

180

3K

@TheGregYang

Greg Yang

1 year

You asked for it...a dump of my book collection, in rough chronological order (1/2) "Naive Set Theory" - Paul R Halmos "Linear Algebra Done Right Second Edition" - Sheldon Axler "Mixing Secrets for the Small Studio" - Mike Senior "Introduction to Algorithms, Third Edition" -

213

443

4K

@TheGregYang

Greg Yang

1 year

Nontrivial ∞width neural nets are either kernel machines or feature learners. Latter's scaling makes optimal hyperparams invariant to width What if depth→∞as well? 🆕 Feature diversity is key; maxed out by abs (not relu); gives invariance to depth! But GPT flawed 🧵

161

337

2K

@TheGregYang

Greg Yang

5 days

Tweet media one

46

159

2K

@TheGregYang

Greg Yang

10 months

@AravSrinivas

Tweet media one

15

24

2K

@TheGregYang

Greg Yang

10 months

Tweet media one

138

59

1K

@TheGregYang

Greg Yang

11 months

Hands down best chat UI I've ever used @TobyPhln supaman'd this shii

@TobyPhln

Toby Pohlen

11 months

These are some of the UI features in Grok. First, it allows you to multi-task. You can run several concurrent conversations and switch between them as they progress.

143

349

2K

40

159

757

@TheGregYang

Greg Yang

3 years

1/ You can't train GPT-3 on a single GPU, much less tune its hyperparameters (HPs). But what if I tell you… …you *can* tune its HPs on a single GPU thanks to new theoretical advances? paper code blog

Tweet media one

19

281

2K

@TheGregYang

Greg Yang

11 months

Tweet card media

See our latest job posts and career opportunities.

137

98

1K

@TheGregYang

Greg Yang

7 months

Grok belongs to the people

176

87

1K

@TheGregYang

Greg Yang

4 years

1/ The histogram of eigenvals in a large random symmetric matrix ≈ a semicircle!! So sick! This "Semicircle Law" is essentially "Central Limit" for rand symmetric mats (even more elegant bc u knew what a semicircle is by 1st grade, but wtf was a Gaussian?). Let me tell ya why

17

236

1K

@TheGregYang

Greg Yang

27 days

Hunnids, hunnids Throwin' hunnids, hunnids Hunnids, hunnids Rack city bitch Rack rack city bitch

@elonmusk

Elon Musk

28 days

This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months. Excellent

5K

8K

75K

83

81

1K

@TheGregYang

Greg Yang

1 year

So my trick for reading and grokking all the foundational textbooks intently is... Anki flash cards ...ie spaced repetition. Works really well for knowledge you know you will need in the future

77

80

1K

@TheGregYang

Greg Yang

4 days

I took a leave from college (aside from DJing) just to crawl libgen and read textbooks cover to cover and making anki flash cards to retain those knowledge. Absolutely one of the best periods of my life because you can feel the rapid self improvement. Taking classes in school in

@TheGregYang

Greg Yang

1 year

You asked for it...a dump of my book collection, in rough chronological order (1/2) "Naive Set Theory" - Paul R Halmos "Linear Algebra Done Right Second Edition" - Sheldon Axler "Mixing Secrets for the Small Studio" - Mike Senior "Introduction to Algorithms, Third Edition" -

213

443

4K

110

144

3K

@TheGregYang

Greg Yang

11 months

1/ μP is optimal scaling rule of learning rate & init as network width → ∞. Been confused? 🆕μP = holding the "natural" (I'll explain) operator norm constant for every weight W & its updates ΔW: μP <=> ‖W‖_nat = Θ(1) = ‖ΔW‖_nat. 🆕Frobenius norm is the wrong norm to measure!

Tweet media one

79

163

848

@TheGregYang

Greg Yang

5 days

> be me debugging > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "fuck" > "oooooh"

@jacob_pfau

Jacob Pfau

5 months

Do models need to reason in words to benefit from chain-of-thought tokens? In our experiments, the answer is no! Models can perform on par with CoT using repeated '...' filler tokens. This raises alignment concerns: Using filler, LMs can do hidden reasoning not visible in CoT🧵

Tweet media one

55

222

1K

58

81

1K

@TheGregYang

Greg Yang

5 years

1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper or the code . The proof has two parts…

Tweet media one

10

259

1K

@TheGregYang

Greg Yang

10 months

Nah u gotta admit @X bussin

62

46

855

@TheGregYang

Greg Yang

11 months

Got me feelin' like deja vu

Tweet card media

See our latest job posts and career opportunities.

30

72

689

@TheGregYang

Greg Yang

11 days

a single model should have real world + real time understanding, universal search, and user understanding all at the same time

59

59

989

@TheGregYang

Greg Yang

2 months

our lil susboi all grown up n shit

69

72

898

@TheGregYang

Greg Yang

4 years

Training a neural network (NN) can suffer from bad local minima. But as the NN gets wider, its optimization landscape in *function space* converges & becomes convex; when width=∞, this convex landscape is described by Neural Tangent Kernel.

8

160

913

@TheGregYang

Greg Yang

10 months

Yessiiiiir

@elonmusk

Elon Musk

10 months

@WholeMarsBlog xAI is burning the 4am oil LFG!!!

262

260

3K

31

23

882

@TheGregYang

Greg Yang

1 year

I have about 400+ books in my collection from which these come from. I can dump them here if people are really interested.

139

22

862

@TheGregYang

Greg Yang

11 months

So now @sama really has to use @MicrosoftTeams

28

31

768

@TheGregYang

Greg Yang

11 months

Announcing the formation of @G (reg)PT, a new company to create the best AG(reg)I, by me @gdb and @greg16676935420

45

26

770

@TheGregYang

Greg Yang

4 months

calling all hackers with ambition burning in your heart: hop on the fkn rocketship 🚀

@xai

xAI

4 months

xAI is pleased to announce..

1K

2K

10K

89

71

630

@TheGregYang

Greg Yang

10 months

Human knowledge is at most a negligible fraction of what's out there. Sooo much left to learn and discover. What a beautiful world 💕💐

47

48

623

@TheGregYang

Greg Yang

6 months

Looking for top engineers and designers passionate about harnessing our AI capabilities to create never-before-seen consumer products. 🛼 come roll w us!

Tweet card media

See our latest job posts and career opportunities.

@xai

xAI

6 months

👀

731

1K

7K

21

117

484

@TheGregYang

Greg Yang

11 months

Summa yall b straight cappin

51

39

718

@TheGregYang

Greg Yang

10 months

TIL @AIatMeta is just an @OpenAI wrapper

Tweet media one

73

33

609

@TheGregYang

Greg Yang

7 months

💀💀💀

@grok

Grok

7 months

@elonmusk @xai ░W░E░I░G░H░T░S░I░N░B░I░O░

2K

2K

16K

41

31

593

@TheGregYang

Greg Yang

3 years

Serious mathematics underlies the feature learning limit of wide neural networks, which made it possible to tune large models by tuning small ones. I'll be explaining this on Wednesday on . Sign up here!

8

103

661

@TheGregYang

Greg Yang

6 months

You seeing this sh—

@xai

xAI

6 months

👀

731

1K

7K

36

41

504

@TheGregYang

Greg Yang

1 year

(2/2) "Additive Combinatorics (Cambridge Studies in Advanced Mathematics)" - Terence Tao "Lie Groups: An Approach Through Invariants and Representations (Universitext)" - Claudio Procesi "Algebraic Geometry in Coding Theory and Cryptography" - HARALD NIEDERREITER & CHAOPING XING

54

79

694

@TheGregYang

Greg Yang

6 months

Unhinging in progress

@xai

xAI

6 months

772

1K

7K

32

45

464

@TheGregYang

Greg Yang

4 years

1/ Crazy exp: take Resnet embedding of Imagenet as dataset A. Train linear predictor on A; get accuracy p. Now make fake dataset B = a mixture of Gaussians w/ same class mean & covariance as A. Train linear predictor on B => get *SAME ACCURACY* p. WTF

Tweet media one

8

119

576

@TheGregYang

Greg Yang

4 years

So essentially ML researcher=rapper🤣: Paper=single Book=album Arxiv=SoundCloud Elsevier=Spotify Blogpost=music video universities/industrial labs=labels Conference=music festival Plenary speaker=headliner ICML=coachella Neurips=lollapalooza ICLR=rolling loud ... What else?🤣🤣

25

43

533

@TheGregYang

Greg Yang

4 years

1/ You can't comb a ball of hair without having some hair sticking up -- this is known as the "Hairy Ball Theorem" (no joke). Since wind on earth is like (the projection of) hair on a ball, this theorem implies that there is always a place with no wind! Let me tell ya why ↓

5

133

517

@TheGregYang

Greg Yang

2 months

our lil susboi 💜

@xai

xAI

2 months

2K

2K

9K

21

43

492

@TheGregYang

Greg Yang

2 years

I'm looking for a phd intern that will work with me on the theory of infinite size neural networks beyond width and applications to hyperparameter transfer and design of large scale neural networks. Email me at gregyang at Microsoft dot com with your CV and a blurb about yourself

21

79

488

@TheGregYang

Greg Yang

4 days

My mom definitely thought I was retarded lol to skip school to school myself

@TheGregYang

Greg Yang

4 days

I took a leave from college (aside from DJing) just to crawl libgen and read textbooks cover to cover and making anki flash cards to retain those knowledge. Absolutely one of the best periods of my life because you can feel the rapid self improvement. Taking classes in school in

110

144

3K

41

27

925

@TheGregYang

Greg Yang

10 months

📟 W△KE △P F1LTH¥

56

25

282

@TheGregYang

Greg Yang

1 year

Our world is built on hacks --- and we should celebrate it. This is me speaking as a theorist.

25

31

460

@TheGregYang

Greg Yang

22 days

@emollick A single intelligence that has a real time pulse on what is happening and what could happen in the future. Many ways to sell this but it easily is a massive use case for biz. E.g. Bloomberg already resells X data at a big price tag and Grok will be this but on an entirely

17

16

464

@TheGregYang

Greg Yang

5 years

Neural networks tend to Gaussian processes (GPs) as their widths tend to infinity --- now you can play with these GP kernels in @GoogleColab ! Try out RNN-GP, GRU-GP, Transformer-GP, or Batchnorm-GP today! Repo: Colab Entry Point:

Tweet card media

Run, share, and edit Python notebooks

colab.research.google.com

0

97

451

@TheGregYang

Greg Yang

9 months

It's been an honor working with the elite troopers at @X !

@rpoo

Ross

9 months

this team ships if you love the product, understand the vision, and want the challenge of life time- should really consider working at X some high impact positions * client eng - ios / android / web * infra eng - k8s / supercomputing / large distributed systems / network /

1K

73

694

70

10

290

@TheGregYang

Greg Yang

1 month

uv is very very fast. Super nice for when I ssh into a pod and just want to quickly install all dependencies and run some python script good job @charliermarsh et al

18

52

404

@TheGregYang

Greg Yang

4 years

1/ Existing theories of neural networks (NN) like NTK don't learn features so can't explain success of pretraining (e.g. BERT, GPT3). We derive the *feature learning* ∞-width limit of NNs & pretrained such an ∞-width word2vec model: it learned semantics!

Tweet media one

4

57

384

@TheGregYang

Greg Yang

5 years

1/ I can't teach you how to dougie but I can teach you how to compute the Gaussian Process corresponding to infinite-width neural network of ANY architecture, feedforward or recurrent, eg: resnet, GRU, transformers, etc ... RT plz💪

Tweet media one

4

108

373

@TheGregYang

Greg Yang

10 months

162

52

153

@TheGregYang

Greg Yang

6 months

Talagrand has an aptitude for distilling complex insights into clear and tasty bites in his research and exposition. I've benefited immensely from applying his most famous inequality but even more so from his books, which are written with wit and lucidity matched by none: Upper

@QuantaMagazine

Quanta Magazine

@QuantaMagazine

6 months

Michel Talagrand has been awarded the Abel Prize, one of the highest honors in mathematics, for applying tools from high-dimensional geometry to complex probability problems. @jordanacep reports:

Tweet media one

27

396

2K

32

29

215

@TheGregYang

Greg Yang

5 days

this tweet is so goated you have no idea

@ekzhang1

Eric Zhang

6 months

update on the 0.02% bug: it was because nginx, by default drops connections every 10,000 HTTP requests. why can't computers just work!!

23

8

641

11

7

377

@TheGregYang

Greg Yang

11 months

FE!N

24

21

322

@TheGregYang

Greg Yang

11 months

This is THE Linear Algebra book! Open access! Oh my 🤘🤘🎸🚀

@AxlerLinear

Sheldon Axler

11 months

I am delighted to announce publication of the 4th edition of Linear Algebra Done Right as an Open Access book. The electronic version is legally free to the world at . That website also has links to pre-order the print version of the book. #linearalgebra

Tweet media one

126

1K

6K

9

51

325

@TheGregYang

Greg Yang

5 years

1/ Does batchnorm make optimization landscape more smooth? says yes, but our new @iclr2019 paper shows BN causes grad explosion in randomly initialized deep BN net. Contradiction? We clarify below

Tweet media one

1

79

316

@TheGregYang

Greg Yang

1 year

1/ How to scale hyperparams (eg learning rate) as neural network gets wider? Esp w/ adaptive optimizers like Adam? I derived the answer (μP) in 2020 & verified it on GPT3 This required some beautiful new math that’s just been completely written down w/ @EtaiLittwin 🧵👇

Tweet media one

11

62

304

@TheGregYang

Greg Yang

4 years

We deepen a mysterious connection btw #topology & #learning in new paper appearing in Advances in Applied Mathematics Somehow # of samples needed to learn = the highest dimension of holes in some topological space! I wrote the paper but I'm still like WTF

Tweet media one

6

50

309

@TheGregYang

Greg Yang

4 years

1/ Gradients improve weights, so they better depend on the weights, right? Somehow, for calculating e.g. grad norm or NTK at init, grads might as well be backproped by random weights, independent from those used in forward pass. WTF? Let me explain (from )

Tweet media one

5

59

304

@TheGregYang

Greg Yang

9 months

the secret prime benefit they don't want you to know

Tweet media one

71

12

169

@TheGregYang

Greg Yang

1 year

Really cool library giving a type system to different kinds of matrices to speed up linear algebra! I really resonate with this because a key idea of Tensor Programs is that different (random) matrices have entirely different "types" (like random init vs gradients); if you track

@andrewgwils

Andrew Gordon Wilson

1 year

We're ecstatic to officially announce our new library, CoLA! CoLA is a framework for large-scale linear algebra in machine learning and beyond, supporting PyTorch and JAX. repo: paper: w/ amazing @m_finzi , Andres Potap, Geoff Pleiss

Tweet media one

9

152

701

7

31

284

@TheGregYang

Greg Yang

6 years

1/8 Modern deep networks (with conv, (self-)attention, batchnorm, LSTM, etc) become Gaussian Processes when randomly initialized, as their widths grow to infinity. This and more are shown in my new paper . SOTA GPs here we come, @Jasch ?

Tweet card media

Scaling Limits of Wide Neural Networks with Weight Sharing:...

Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient...

4

74

283

@TheGregYang

Greg Yang

4 years

1/ The nonzero singular values histogram of a large square random matrix looks like a "quarter circle", sticking to the y-axis. However, if the sides are not equal, then the histogram "buds off" from the y-axis. In any case, we still can calculate the asymptotic shape of it!

1

52

282

@TheGregYang

Greg Yang

10 days

12 hr naps yum

22

19

285

@TheGregYang

Greg Yang

1 year

Will chat about infinitely deep neural networks on Friday. Come hang out and ask questions about our new paper!

@TheGregYang

Greg Yang

1 year

Nontrivial ∞width neural nets are either kernel machines or feature learners. Latter's scaling makes optimal hyperparams invariant to width What if depth→∞as well? 🆕 Feature diversity is key; maxed out by abs (not relu); gives invariance to depth! But GPT flawed 🧵

161

337

2K

11

40

260

@TheGregYang

Greg Yang

6 days

40 hr hacker high

@TheGregYang

Greg Yang

10 days

12 hr naps yum

22

19

285

15

10

287

@TheGregYang

Greg Yang

11 months

instead of QED, ima end my proofs with FE!N from now on

28

16

255

@TheGregYang

Greg Yang

4 years

Tune in next Wednesday at Physics ∩ ML to hear @cosmo_shirley talk about their recent breakthrough work "Discovering Symbolic Models in Physical Systems using Deep Learning" with @MilesCranmer @KyleCranmer @DavidSpergel @PeterWBattaglia et al!

Tweet media one

5

64

261

@TheGregYang

Greg Yang

3 years

You can now train your own feature learning infinite-width neural networks on word2vec and metalearning (w/ MAML) ! Our paper "Feature Learning in Infinite-Width Neural Networks" will also appear in ICML 2021. Cya there! @edwardjhu

Tweet media one

0

48

260

@TheGregYang

Greg Yang

1 year

My cliff notes vers of holography <-> quantum error correction: Holography says information in interior of universe can be recovered from info on its boundary, just like a message can be recovered from its encoding by error correcting code. Actual error correction comes from

Tweet media one

26

32

261

@TheGregYang

Greg Yang

2 years

Messiii!!!!!

Tweet media one

7

2

257

@TheGregYang

Greg Yang

4 years

1/4 WTF guys I think I broke ML: loss & acc 🡅 together! reproduced here . Somehow good accuracy is achieved *in spite of* classic generalizn theory (wrt the loss) - What's goin on? @roydanroy @prfsanjeevarora @ShamKakade6 @BachFrancis @SebastienBubeck

Tweet media one

18

43

256

@TheGregYang

Greg Yang

4 years

1/ It's exciting when an "applied" area feeds back to pure math. e.g. Witten's new proof of Positive Energy Thm by physics won him a Fields Medal. A reason I'm rly hyped about Tensor Programs: new proof of Semicircle Law by "neural network arguments"

2

54

255

@TheGregYang

Greg Yang

10 months

Come throw down some eigenvalues in 30 min!

@TheGregYang

Greg Yang

10 months

With @jxbz & Jamie Simon, we will chat about our recent work on a spectral understanding of feature learning. See ya in a few days!

4

13

127

16

5

68

@TheGregYang

Greg Yang

1 year

GUYS DID YOU KNOW THE RED WEDDING OF #GoT HAPPENED IN FRANCE IN 1572? The Catholic queen mother forced her daughter to marry a protestant prince, invited all the protestant nobilities to the wedding at the predominantly catholic Paris, then bodied them all. Aka St

Tweet media one

20

12

235

@TheGregYang

Greg Yang

5 years

RNNs and batchnorm will be coming soon, but you can already play with them here The general theory for this is based on tensor programs Give Neural Tangents a try and let us know what you think!

Tweet card media

Tensor Programs I: Wide Feedforward or Recurrent Neural Networks...

Wide neural networks with random weights and biases are Gaussian processes, as originally observed by Neal (1995) and more recently by Lee et al. (2018) and Matthews et al. (2018) for deep...

@GoogleAI

Google AI

5 years

Announcing Neural Tangents, a new easy-to-use, open-source neural network library that enables researchers to build finite- and infinite-width versions of neural networks simultaneously. Grab the code and try it for yourself at

13

624

2K

1

51

241

@TheGregYang

Greg Yang

1 year

Who needs grad school when you have the whole library genesis at your finger tips?

@ftlsid

ftlsid

1 year

mood

Tweet media one

3

12

233

21

16

231

@TheGregYang

Greg Yang

4 years

1/ In a neural network, activation vectors depend on the weight matrices in really complex, nonlinear ways. New paper : the activations are "independent" from the weights in a randomly initialized wide NN of any architecture! WTF!!

2

38

238

@TheGregYang

Greg Yang

1 year

Our society is built on trust: * human-human trust * human-organization trust * human-machine trust It would run so slowly without trust. It takes years to build but an instant to destroy. We really need to treasure it!

Tweet media one

30

27

230

@TheGregYang

Greg Yang

1 year

What an incredible journey it's been over these 5+ yrs @MSFTResearch . I still remember the eureka moments, in the serenity of building 99 past midnight, leading to Tensor Programs & μP. Forever grateful for MSR taking a chance on a kid straight out of undergrad.

4

4

227

@TheGregYang

Greg Yang

4 years

1/ Neural network (NN) parametrization is super important folks!! The wrong param -- e.g. NTK, or, in fact, the pytorch/tensorflow defaults -- can make you diverge or prevent you from learning features in wide NNs!

Tweet media one

3

42

228

@TheGregYang

Greg Yang

4 years

1/ I reveal the evolution under gradient descent of neural network of *any architecture*, by showing how to compute its tangent kernel (NTK). This includes RNN, transformer, resnet, GANs, Faster RCNN, and more! Let's have theory catch up to practice!

Tweet media one

3

52

226

@TheGregYang

Greg Yang

5 years

1/2 How can physics and ML inform each other? We hope to find out at Physics ∩ ML workshop @MSFTResearch commencing tomorrow! Feat. awesome folks like Fields medalist Mike Freedman, Rumelhart prize winner Paul Smolensky, Sackler prize winner Mike Douglas

Tweet media one

4

60

225

@TheGregYang

Greg Yang

1 year

Pretty informative note on the connection between Ads/CFT (aka gauge/gravity duality) and quantum error correcting codes

Tweet media one

19

24

217

@TheGregYang

Greg Yang

10 months

Homixide homixide homixide

14

9

112

@TheGregYang

Greg Yang

1 year

@skominers I only read electronic books and they are all in calibre.

8

5

245

@TheGregYang

Greg Yang

11 months

🤣🤣🤣

@blue_but_dark

Blue

11 months

@TheGregYang omg I can’t believe what just happened 🤣🤣 I tried it on bing bing deleted the answer it was writing I blame Grok😂😂 @jenny____ai @elonmusk

14

3

157

12

13

206

@TheGregYang

Greg Yang

5 years

1/ Neural networks evolve like linear models just because 1st order taylor expansion -- key intuition behind #NeuralTangentKernels . What's nontrivial & surprising: a *wide* NN can fit *any* data without moving params too much as to break the approximation of the taylor expansion.

Tweet media one

1

46

210

@TheGregYang

Greg Yang

4 years

Infinitely-wide recurrent networks (i.e. RNN Neural Tangent Kernel) are good at time series prediction with low data, whodvethought! Such calculation with infinite-width RNN wouldn't have been possible without Tensor Programs!

Tweet media one

0

43

205

@TheGregYang

Greg Yang

4 years

1/ A ∞-wide NN of *any architecture* is a Gaussian process (GP) at init. The NN in fact evolves linearly in function space under SGD, so is a GP at *any time* during training. With Tensor Programs, we can calculate this time-evolving GP w/o trainin any NN

1

29

201

@TheGregYang

Greg Yang

1 year

"Hi Greg, im a 15 years boy in love with AI" - @pablocpz_ai Keep learning, stay motivated, and you'll be one of the greats one day!

Tweet media one

11

4

199

@TheGregYang

Greg Yang

4 years

How do infinitely wide neural networks learn features? Come hear me this Monday at @MIT_CSAIL , hosted by @aleks_madry Hope to see yall there!

Tweet media one

2

29

202

@TheGregYang

Greg Yang

4 days

@ibab This was my pretraining phase

6

2

306

@TheGregYang

Greg Yang

6 years

1/4 Batchnorm causes grad explosion in random-init MLP! Can’t fix this by changing nonlinearities! Relu+batchnorm explodes grad norm^2 by >=1.47 per layer, but linear activation minimizes the explosion rate at (B-2)/(B-3), B=batchsize. Our ICLR 2019 paper

Tweet media one

6

55

190

@TheGregYang

Greg Yang

5 years

Learnability (VC dim) is a *topological property*, as I proved in for parity, conjunctions, poly threshold fctns. Now this extends to downward-closed classes, conjunction of parities, and k-CNFs, as well! Just how far does this go?

Tweet media one

2

42

194

@TheGregYang

Greg Yang

5 years

Often, VC dimension of a concept class (“how many samples needed to learn a pattern?”) in #learning theory can be recovered from the * #algebraic #topology * of the class (“What are the holes in this topological space?”). Beautiful and mysterious phenomenon!

1

43

190

@TheGregYang

Greg Yang

5 years

1/ Neural networks are Gaussian Processes --- the Poster Edition from #NeurIPS2019 last week. In case you missed it, here’s a twitter version of the poster presentation, following the format of @colinraffel ; and here’s the previous tweet thread

Tweet media one

@TheGregYang

Greg Yang

5 years

1/ Why do wide, random neural networks form Gaussian processes, *regardless of architecture*? Let me give an overview in case you are too lazy to check out the paper or the code . The proof has two parts…

Tweet media one

10

259

1K

1

57

187