Thomas Ahle @thomasahle Twitter profile

Last Seen Profiles

@josefkalina

@willywonk_8

@rustrover

@hader_vip

@digitalmethods

@Squash2022

@tnmyt_2030

@fattsbean

@CaruhArt

@aha201550

@PiPi_pang

@s_otani_brass

@tk_numanch

@sand_high

@thatfilokiddo

@takyyxd

@Britney_kyra03

@raimutonsena73

@FemicomMuseum

@Solar_SSB

@Manpuku_gacha

@jinblossoms

@ankariisme

@GiorgallasM

@rgb_GENESIS

@ubebeJJ05

@AlexL2027

@somanovich

@suntihill

@richyieze

@sitiazieraa

@tanu_unat_

@bokeplokalmalam

@YuAuter

@sevimbensevim

@same_nyan_sen

Thomas Ahle

@thomasahle

1 month

Sam Altman: Ilya is leaving to work on a small project that's personally meaningful to him. Ilya: Single Shot Superintelligence

Ilya Sutskever

@ilyasut

1 month

I am starting a new company:

2K

3K

31K

13

84

2K

Thomas Ahle

@thomasahle

2 years

I've been laid off from Meta. Our entire research/infra org "Probability" was cut. I deeply appreciate the people who helped me get there, and the great people I worked with for a year and half. I hope to stay in the Bay Area a while longer, if anyone needs some algorithms.

40

64

1K

Thomas Ahle

@thomasahle

2 months

KANs (NNs with learned functions on the edges) have a quite elegant representation using Tensor Diagrams. This chart of MLP layers also shows some neat relationship between things like ReGLUs and MoEs.

8

184

1K

Thomas Ahle

@thomasahle

1 year

Jax has a different definition of the identity matrix than I'm used to...

27

52

899

Thomas Ahle

@thomasahle

3 months

Doing Matrix Calculus can be messy, specially when we need higher order derivatives. Writing them out using Tensor Diagrams makes even the Hessian Chain Rule relatively simple:

15

123

838

Thomas Ahle

@thomasahle

10 months

It is a common misconception that LLMs are just trained to "predict the next token". No. They are trained to predict an entire context window's worth of tokens, like 4k+. The gradients go end to end and the model is allowed to plan what it will say next.

44

57

797

Thomas Ahle

@thomasahle

2 months

I always found the tensor notation in Fast Matrix Multiplication algorithms confusing. But using tensor diagrams it's pretty easy to see what's going on:

7

101

787

Thomas Ahle

@thomasahle

9 months

Lovely table of ways to compute e^A, the exponential of a matrix, by @nhigham

12

91

690

Thomas Ahle

@thomasahle

9 months

Retrieval Augmented Generation is such a hack. 🔨 Why would an embedding of your prompt coincide in the embedded space with with documents needed to answering it? Meanwhile Transformers already have a key/query mechanism build in! 🔋 Can we just put all the key/value pairs into

Normal Computing 🧠🌡️

@NormalComputing

9 months

Extended Mind Transformers (EMTs) are a new approach to working with very large contexts and external data sources developed by @KlettPhoebe , @thomasahle , Normal's AI team. Inspired by the Extended Mind Thesis, we modify Multihead Attention to directly query a vector database.

5

43

220

20

73

604

Thomas Ahle

@thomasahle

2 months

Using 𝚝𝚘𝚛𝚌𝚑.𝚌𝚘𝚖𝚙𝚒𝚕𝚎 makes KANs as fast as MLPs! I never thought I would be a fan, but they are starting to look pretty appetizing.

10

76

509

Thomas Ahle

@thomasahle

3 months

The KAN hype has shown many people are thinking transformers still use MLPs‼️ However all the big models we know have switched to GLUs, such as Gemma (GeGLU), LLama (SwiGLU) and Palm (SwiGLU). These "activation functions" actually take two linear projections and multiply them

13

78

486

Thomas Ahle

@thomasahle

7 months

My understanding of AlphaGeometry: 1) Translate the problem statement to symbolic form. 2) Try to solve the problem with a symbolic solver. 3) If it didn't work, use a language model to suggest an "auxiliary point" somewhere, such as a midpoint, then go to (2). To teach the

Google DeepMind

@GoogleDeepMind

7 months

Introducing AlphaGeometry: an AI system that solves Olympiad geometry problems at a level approaching a human gold-medalist. 📐 It was trained solely on synthetic data and marks a breakthrough for AI in mathematical reasoning. 🧵

126

1K

4K

8

82

474

Thomas Ahle

@thomasahle

2 years

I was asked today about great 📚 books about 🎲 probability for students and practitioners interested in Algorithms and ML. These are some of my favorites that I keep coming back to 👇

5

76

418

Thomas Ahle

@thomasahle

13 days

@deedydas Well done US team!

9

26

421

Thomas Ahle

@thomasahle

5 months

Needle In A Haystack tests are flawed. Did you know that the Long-Context Attention in Gemini and GPT-4 is based on inserting the sentence “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day” at a random location in a text? We’ve seen

Phoebe Klett

@KlettPhoebe

5 months

Our recent experiments @NormalComputing demonstrate that all attention is not equal! XL-attention (large context windows) struggles with retrieval tasks well within context. Top-k attention (used in our Extended Mind approach) is cheap and effective.

6

29

164

17

61

392

Thomas Ahle

@thomasahle

2 months

Kronecker products allow you to visualize the entirety of many recursive algorithms in simple tensor diagrams. Let's try to do this for the famous Fast Fourier Transform. But first, consider its easier cousin, the Hadamard: This is analyzed using the "master theorem" of

5

54

359

Thomas Ahle

@thomasahle

3 months

Anthropic's Mathematical Framework for Transformer Circuits is all about Kronecker Products; Because they give you a way to model parallel computation (transformer heads).

Thomas Ahle

@thomasahle

3 months

The Kronecker Product in Linear Algebra is just a tensor product "flattened" on both sides. We can illustrate this with tensor diagrams, by defining the "flattening tensor", ▷ᵢⱼₖ=[i + j n = k]. Here the Matrix Cookbook section translated into diagram form:

3

25

165

2

64

349

Thomas Ahle

@thomasahle

9 months

Complexity of Matrix Multiplication ≈ Complexity of Matrix Inversion. Why? Turns out there is a really neat trick to compute matrix multiplications using inverses: This can also be used with our "Fast Thermodynamic Matrix inversion": to get Fast

Patrick Coles

@ColesThermoAI

9 months

The prospect of using thermodynamic computers for AI applications and probabilistic reasoning is enticing. With linear algebra as a key first step, our theory paper showed asymptotic speedup (relative to digital methods) that scales linearly in dimension

2

9

54

3

53

336

Thomas Ahle

@thomasahle

2 months

Exciting new paper: TextGrad: Automatic “Differentiation” via Text! This is a DSPy-like framework for optimizing prompts in a composite LLM system. However, there is one major difference! In DSPy the idea is (basically): - Forward pass: Each Module (LLM call) generates picks a

16

63

308

Thomas Ahle

@thomasahle

9 months

In our recent NeurIPS paper we had to show the following cute inequality: For a real world application, ask all your friends to think of a number. Divide each number by the sum, and you'll get in expectation at least 1/n. This holds even if you give your friends weights. Seems

Thomas Ahle

@thomasahle

9 months

Clustering the Sketch: Dynamic Compression for Embedding Tables Paper Website: ArXiv: Embedding tables are used by all machine learning systems that work with categorical data, like user IDs or word tokens. At Meta we had

1

6

44

3

38

298

Thomas Ahle

@thomasahle

2 years

A frequently overlooked fact is that KMeans is simply a matrix factorization algorithm X ≈ HM, where H is limited to 1 one per row. What if we allowed H to have 2 ones per row instead?👇

12

40

302

Thomas Ahle

@thomasahle

5 months

@kaseyklimes We're trying

5

7

245

Thomas Ahle

@thomasahle

1 month

@karpathy Of all the algorthms you learn in CS, who would have thought Topological Sort would be the one to continuously spawn billion dollar companies?

1

13

218

Thomas Ahle

@thomasahle

16 days

Saying "Synthetic data generation is just GPT-4 distillation" is like saying "Writing textbooks is just Human distillation." A lot goes in to creating good synthetic datasets! The Self-Prompt technique by Li et al. is a cool example of generating a hard Q&A dataset. Let's walk

5

30

193

Thomas Ahle

@thomasahle

26 days

Activation Functions differ drastically in how the MLP maps output at initialization. Inspired by @kellerjordan0 () I initialized some "deep and wide" MLPs, and observed how the angles and norms of points develop through the network. Surprisingly, GeLU

Keller Jordan

@kellerjordan0

30 days

I'm just starting to learn about this neural tangent kernel / NNGP / tensor programs stuff-- It was pretty amusing to find out that deep MLPs have nearly-constant output at initialization

3

8

172

7

33

183

Thomas Ahle

@thomasahle

4 months

Memorizing a few linear stochastic matrix expressions is pretty useful. But what about higher order formulas? Things quickly get weird... 🧵1/4

1

24

181

Thomas Ahle

@thomasahle

3 months

The "Trace" section of the The Matrix Cookbook translated into tensor diagrams. The nice thing about the diagrams is that they show why the formulas are true.

8

11

175

Thomas Ahle

@thomasahle

11 months

Just got my first actual real samples back from our thermodynamic chip 🤯

Max Aifer

@MaxAifer

1 year

Enter thermodynamic computing. In this preprint, Thermodynamic Linear Algebra (), we show that a system of coupled oscillators in contact with a heat reservoir can be used to solve linear systems in an amount of time proportional to the number of variables.

11

79

401

3

26

169

Thomas Ahle

@thomasahle

3 years

@blekhman It's called the Prof. Dr. Style, and designers study it as one of the most authentic parts of the web:

4

18

169

Thomas Ahle

@thomasahle

3 months

The Kronecker Product in Linear Algebra is just a tensor product "flattened" on both sides. We can illustrate this with tensor diagrams, by defining the "flattening tensor", ▷ᵢⱼₖ=[i + j n = k]. Here the Matrix Cookbook section translated into diagram form:

3

25

165

Thomas Ahle

@thomasahle

6 months

Trying to combine DSPy, Pydantic types and JSON Schemas

8

12

155

Thomas Ahle

@thomasahle

8 months

When I started in NLP, word vectors were state of the art thinking. 🧠 In Oxford we were very much still doing grammar based computational linguistics, but I remember one talk that suggested a middle path.💡 The idea was that if a noun phrase is a vector, then an adjective must

Aravind Srinivas

@AravSrinivas

8 months

Tomas Mikolov, the OG and inventor of word2vec, gives this thoughts on the test of time award, and the current state of NLP, and chatGPT. 🍿

27

176

1K

2

17

152

Thomas Ahle

@thomasahle

3 years

I just learned about @jcvbcn and @HongxunWu 's beautiful algorithm for Subset Sum in Õ(n+t) time, where t is the target sum. Recall the classical dynamic programming solution (by Bellman 1957) takes O(n⋅t) time. How is this improvement possible? 1/n

3

40

148

Thomas Ahle

@thomasahle

5 months

We took a large file, made a copy with some edits, and asked Gemini Pro 1.5 to find the differences... It completely failed. Not only did it not find any of the differences, it returned a long list of completely hallucinated changes! This shows that long contexts, while good for

Mengdi Chen

@mengdi_en

5 months

Per Thomas's suggestion, I ran this experiment with @google Gemini Pro 1.5, the LLM with longest context as of today. I grabbed a lengthy gov document, made a copy with 10 random text edits. The result is... 100% hallucination, unfortunately.

3

4

43

12

22

134

Thomas Ahle

@thomasahle

1 year

Relationship ended with Reviewer #2 . Now Reviewer #4 is my worst enemy. #NeurIPS2023

6

1

133

Thomas Ahle

@thomasahle

15 days

@johann_josefy @Carnage4Life Have they ever closed a service and not given you a chance to download and migrate? If that ever happens, just buy an SSD and download to that 🤷‍♂️

8

0

134

Thomas Ahle

@thomasahle

2 months

SELUs were a nice activation function because they preserved the mean and variance of inputs, E[x]=0, E[x²]=1. However, they looked weird and didn't go to 0 as x → −∞. Just look at it: Can we fix this? Most people today use Swish and/or Gelu. (At least in major transformer

7

16

127

Thomas Ahle

@thomasahle

5 months

Can we use LLMs to discover better PyTorch programs? This weekend, @yaroslavvb challenged me to use an LLM to find a model that trains CIFAR-10 to 95% accuracy in 5s on an A100. (Like @kellerjordan0 did.) I didn't exactly succeed, but I did find some pretty fun programs. Let's

1

14

124

Thomas Ahle

@thomasahle

2 years

@sharongoldman 19 people doing Bayesian Modeling, 9 people doing Ranking and Recommendations, 5 people doing ML Efficiency, 17 people doing AI for Chip Design and Compilers. Plus managers and such.

12

7

119

Thomas Ahle

@thomasahle

15 days

VLMs aren't blind: @danielcorin1 Changing the prompt from "How many times do the blue and red lines intersect?" to "How many times do the blue and red line plots cross each other?" increases the accuracy of Claude 3.5 Sonnet from 73% to 95%. Benchmarking

8

10

120

Thomas Ahle

@thomasahle

1 year

For those saying the "@" operator is hard to read, the PEP () has some pretty good examples to the contrary.

Edward Raff

@EdwardRaffML

1 year

@thomasahle I disagree. “+” makes sense linguistically and mathematically in almost all uses. Same with “-“. “@“ dose not linguistically read right, we read the common @ far more than weird python “@“. If it actually made sense, people would use it more.

2

0

7

14

7

115

Thomas Ahle

@thomasahle

2 months

Karger's Algorithm for Min Cut. Me trying to learn Manim.

6

9

117

Thomas Ahle

@thomasahle

1 year

Have you ever derived and coded backprop from scratch?

Yes

1858

No

1053

Show

555

73

6

111

Thomas Ahle

@thomasahle

3 months

Andrew's list for working with large context models: (1) Write quick, simple prompts (2) Iteratively, flesh out a mega-prompt (3) Few-shot or many-shot examples (4) Break into subtasks / agentic workflow I want to suggest an alternative "Eval Driven" workflow: (1) Write quick,

Andrew Ng

@AndrewYNg

3 months

This week, Google announced a doubling of Gemini Pro 1.5's input context window from 1 million to 2 million tokens, and OpenAI released GPT-4o, which generates tokens 2x faster and 50% cheaper than GPT-4 Turbo and natively accepts and generates multimodal tokens. I view these

77

587

3K

5

29

109

Thomas Ahle

@thomasahle

17 days

When I first drew Strassen's algorithm as a Tensor Diagram, I chickened out at the last step. But here I give you the complete diagram: "Strassen's Kringle".

Thomas Ahle

@thomasahle

2 months

I always found the tensor notation in Fast Matrix Multiplication algorithms confusing. But using tensor diagrams it's pretty easy to see what's going on:

7

101

787

3

21

111

Thomas Ahle

@thomasahle

1 year

Okay, maybe just don't use jax-metal quite jet.

Thomas Ahle

@thomasahle

1 year

@dan_p_simpson Ok, maybe this is just a bug in the new Metal backend.

1

29

3

6

99

Thomas Ahle

@thomasahle

6 months

I did a new analysis of our Thermodynamic Linear Algebra algorithm, based on continuously integrating a simple stochastic differential equation. It's interesting that if you want to find x such that ‖Ax−b‖₂ < ε, you can do it in time O(d ε⁻²) on a machine like

4

19

98

Thomas Ahle

@thomasahle

10 months

This is why stuff like Medusa works: You can add extra decoder heads and predict multiple future tokens at once. Really it's crazy to think you could produce meaningful text if you only think about "the next token".

4

1

96

Thomas Ahle

@thomasahle

26 days

@Thom_Wolf I'm surprised the MuMath-Code paper wasn't more discussed on Twitter. It is awesome! Super dense in information. Only reference I found was: who incidentally tweets about a lot of really interesting papers that don't get much attention.

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

3 months

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning ↓

0

10

58

1

7

89

Thomas Ahle

@thomasahle

3 months

Basic Derivatives from the Matrix Cookbook in Tensor Diagram form.

2

8

84

Thomas Ahle

@thomasahle

8 months

Just adding some data from our internal "reasoning under uncertainty" benchmarks to the debate of "Is ChatGPT Getting Worse?"

6

11

83

Thomas Ahle

@thomasahle

3 years

New algorithm by Andoni and Beaglehole uses Multiplicative weight update to optimize Locality Sensitive Hashing for a given dataset. This gives a practical, yet robust solution to high-dimensional nearest neighbour search. 1/2

Machine Learning | arXiv

@arXiv__ml

3 years

#arXiv #machinelearning [cs.LG] Learning to Hash Robustly, with Guarantees. (arXiv:2108.05433v1 [cs.DS]) The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Local…

0

6

18

2

19

83

Thomas Ahle

@thomasahle

4 months

Fully Homomorphic Encryption with GPT 🥳

3

7

79

Thomas Ahle

@thomasahle

1 year

@LongFormMath Roses are red, violets are blue, Calculus is the poetry that nature imbues, Integration by parts is a technique divine, It brings us closer to the truth, like two souls intertwined. 🤖

0

6

80

Thomas Ahle

@thomasahle

2 months

Here is a simple 3-vector function that should be linear memory, but is quadratic memory in torch or numpy: 𝚛𝚎𝚕𝚞(𝚡.𝚞𝚗𝚜𝚚𝚞𝚎𝚎𝚣𝚎(𝟶) + 𝚢.𝚞𝚗𝚜𝚚𝚞𝚎𝚎𝚣𝚎(𝟷)) @ 𝚣 Is there a trick to make this linear memory without using python loops, or writing a new cuda kernel?

8

3

79

Thomas Ahle

@thomasahle

12 days

Taking the hessian of aᵀX²b: (1) using the perturbation method; (2) using index notation; and (3) using tensor diagrams:

3

19

108

Thomas Ahle

@thomasahle

1 year

What is the hackiest MNist classifier that gets > 10% accuracy? For example, taking the mean value of each image, and using a nearest centroid classifier, gives 22% accuracy.

20

5

75

Thomas Ahle

@thomasahle

2 months

If you like the graphical Hessian Chain Rule, you may be interested in @yaroslavvb 's question (here ) on how to actually compute it most efficiently. This relates to the Matrix Chain Problem, which has a nice dynamic programming solution, but for sums of

Thomas Ahle

@thomasahle

3 months

Doing Matrix Calculus can be messy, specially when we need higher order derivatives. Writing them out using Tensor Diagrams makes even the Hessian Chain Rule relatively simple:

15

123

838

0

10

75

Thomas Ahle

@thomasahle

11 months

This week I trained an 800K transformer to learn 5 digit multiplication. Then I replaced "xxx*yyy=?" with "xxx*yyy_____=?", giving the model 5 extra tokens for "computation". Now the model quickly learned 7 digit multiplication. It's a nice trick.

Jay Hack

@mathemagic1an

1 year

"Chain of thought" allows model to "think" more, or expend more FLOPS, thereby improving performance Does this imply that giving LLMs large amounts of padding tokens will improve performance as well? 🤔 Also forces increased FLOPs in computing the answer.

16

12

90

4

5

74

Thomas Ahle

@thomasahle

1 year

The fourth moment bound generalizes Cantelli's inequality to give lower bounds on "tail" probabilities even when the mean is zero. But what if you want P[X ≥ λ] ≥ ...?

5

4

70

Thomas Ahle

@thomasahle

1 year

And using .eye doesn't help.

2

1

71

Thomas Ahle

@thomasahle

1 year

@BlackHC @kgourg Yeah, didn't find the review particularly meaningful... (Reposted to hide submission id)

9

4

72

Thomas Ahle

@thomasahle

1 year

Fine, I'll just make a zero array and assign 1 to the diagonal, right? Nop.

3

0

67

Thomas Ahle

@thomasahle

1 year

I keep seeing code like torch.matmul(A, B) or matmul(A, B). In Hugging Face and other open source libraries. Why no love for the operator form, A @ B ?

12

3

65

Thomas Ahle

@thomasahle

2 years

@ccanonne_ Represent the sets as bit-strings. For each position you allow three cases: (0,0), (0,1), (1,1), but not (1,0). Each bit is independent so you get (3/4)^n.

4

0

66

Thomas Ahle

@thomasahle

2 years

Definitely check out "The Probabilistic Method" by Joel Spencer and Noga Alon. The first four chapters really give a nice idea of the algorithmic stuff you can do with probability. The remaining chapters show you how wide the field is. Every proof is "from the book".

2

3

65

Thomas Ahle

@thomasahle

1 year

We know A⁻¹x can be computed in n²√κ time using the conjugate gradient method. But what about other powers, like A⁰ᐧ⁵x? Turns out that yes! But it requires some pretty weird contour integrals and elliptic functions:

Iterative algorithm for computing $\Sigma^{1/2} x$

Say I have a PSD matrix $\Sigma$ and a vector $x$, is there an iterative algorithm (faster than computing $\Sigma^{1/2}$ using Cholesky decomposition) for computing $\Sigma^{1/2} x$? (In this ques...

math.stackexchange.com

4

5

63

Thomas Ahle

@thomasahle

5 months

Being an Open Source maintainer in 2024: User: My code doesn't work Me: Where did you get this code from? That doesn't look like our API at all User: GPT Me: Do you mind reading our docs instead? User: ...

4

3

62

Thomas Ahle

@thomasahle

4 months

Do you prefer your math formulas look like this, rather than long pages of incomprehensible matrix multiplications? Then you might want to try tensorgrad: It's a library for symbolic tensor manipulation and differentiation I just made! I'll post more

GitHub - thomasahle/tensorgrad: Tensor Network Library with Autograd

Tensor Network Library with Autograd. Contribute to thomasahle/tensorgrad development by creating an account on GitHub.

github.com

Thomas Ahle

@thomasahle

4 months

Alternatively, we can write out all the terms in the cubic formula (without biases) like this. I wonder if there's some kind of series expansion rule in play here?

2

1

9

3

10

63

Thomas Ahle

@thomasahle

7 months

Did anybody try using genetic programming to improve LLM Agent's prompts? You let a bunch of them run with somewhat different prompts/rules/guidelines. Then combine the best pairs to form the next generation. You could also just make mutations (asexual reproduction), that gives

18

8

59

Thomas Ahle

@thomasahle

3 months

Many-shot in-context learning is the new fine-tuning. And #DSPy is the framework to make it fun.

3

6

58

Thomas Ahle

@thomasahle

1 year

Great note by @yaroslavvb on Tensor Networks and Autodiff: It includes the only understandable "Chain Rule for Hessians" I've ever seen.

1

11

57

Thomas Ahle

@thomasahle

3 months

Some basic rules for Tensor Diagram simplification, illustrated as graphs

5

8

56

Thomas Ahle

@thomasahle

1 year

Linear systems, Ax=b, can be solved in O(n²√κ) time (using Conjugate gradient), so matrices with low condition number (κ) can be solved much faster than n^ω (matrix inversion or decomposition). Can other linear algebra problems also be solved faster?

Condition Number dependent algorithms for matrix operations

Using the Conjugate gradient method we can solve a linear system $Ax=b$, where $A\in\mathbb R^{n\times n}$ in time $O(n^2 \sqrt{\kappa})$, where $\kappa=\frac{\sigma_\mathrm{max}(A)}{\sigma_\mathrm...

cstheory.stackexchange.com

3

5

56

Thomas Ahle

@thomasahle

3 years

I will be speaking today at the Sydney Algorithms Seminar on Tensor Sketching with applications to kernel tricks and neural network compression. Join in at 11am Sydney time :-) it's cool stuff. Thanks for @ccanonne_ for organizing.

3

7

52

Thomas Ahle

@thomasahle

9 months

2-Dimensional random walks eventually always return... But it takes a long time.

Fermat's Library

@fermatslibrary

9 months

Probability of returning to the origin in a random walk: 1D → P=1 2D → P=1 3D → P=0.34 Large D → P=1/2D

76

229

3K

2

1

54

Thomas Ahle

@thomasahle

1 year

Yesterday I left @Meta (after being laid off in November and rehired in January) to join @NormalComputing . Normal is kinda in stealth right now, but we'll share more soon! I'm excited to be working with @FarisSbahi , @ColesThermoAI , @remilouf and the rest of the amazing team!

4

1

53

Thomas Ahle

@thomasahle

5 months

DSPy supports arbitrary functional validators.

3

4

55

Thomas Ahle

@thomasahle

2 years

Has anyone seen this distribution? It is the (numerically) closest distribution (in l1 norm) to uniform that can be written as the sum of two identically distributed random variables. But is it known? Does it have a name? A family?

7

1

51

Thomas Ahle

@thomasahle

7 months

Making a synthetic dataset of mathematical proofs is hard! It's easy to make a whole lot of "1+1+1+...=491" style theorems. I'm surprised this method of random construction and transformation finds so many classical geometric theorems. Maybe because the domain is somewhat

trieu

@thtrieu_

7 months

Proud of this work. Here's my 22min video explanation of the paper:

39

156

775

4

5

50

Thomas Ahle

@thomasahle

2 years

@ellewoodsgolfs @ProfRobAnderson Of course it's a joke, but it makes a good point about a lot of faculty being severely underpaid. If universities don't want to pay lecturers, maybe students need to start tipping.

1

0

49

Thomas Ahle

@thomasahle

5 months

Everyone's so excited about RingAttention, but what happened to the other 100 or so "linear attention" papers that's come out since 2019?

Thomas Ahle

@thomasahle

5 months

@roydanroy Even in 2020 we already had: Sparse Transformers (Child et al., 2019), Reformer (Kitaev et al., 2020), Linformer (Wang et al., 2020), Longformer (Beltagy et al., 2020), Sinkhorn Transformers (Tay et al., 2020b), Performers (Choromanski et al., 2020b), Synthesizers (Tay et al.,

0

10

6

1

49

Thomas Ahle

@thomasahle

18 days

All of the AIMO top 4 solutions now have write-ups on Kaggle:

2

5

48

Thomas Ahle

@thomasahle

3 years

The Fast Hadamard transform is one of those algorithms that are "so simple it must be optimal". A perfect example of a recursive algorithm. However, @firebat03 just showed that it can be improved using entirely non-simple techniques:

4

10

47

Thomas Ahle

@thomasahle

7 months

Some more notes, the model is quite small, 12 layers & 1,024 dim. This should be encouraging for researchers. On the other hand, they don't just greedily decode the model, but use a beam search over 512 generations. Makes sense to extract the highest quality auxiliaries before

1

0

47

Thomas Ahle

@thomasahle

4 months

I'm fascinated by the idea of "bidirectional parsing": It's something like a combination of a templating language and a parser. If we could use this for #dspy , the LLM could optimize its own prompting templates, and parsing outputs would come for free.

0

3

48

Thomas Ahle

@thomasahle

3 years

Here is another fun moment inequality to prove: It should hold for integer random variables X and n ≥ 1.

7

4

48

Thomas Ahle

@thomasahle

1 month

There are so many great "Named Tensors" libraries. Why did none of them take off? Named Tensor: Tensor Shape Annotations: Axis Arrays: Pytorch's named tensors:

GitHub - JuliaArrays/AxisArrays.jl: Performant arrays where each dimension can have a named axis...

Performant arrays where each dimension can have a named axis with values - JuliaArrays/AxisArrays.jl

github.com

10

7

47

Thomas Ahle

@thomasahle

2 months

Fun meeting some fellow #DSPy heads, @CShorten30 and @thomastjoshi , at the Compound AI Systems Workshop!

2

8

47

Thomas Ahle

@thomasahle

8 months

This is a wild difference.

Anthropic

@AnthropicAI

8 months

Claude 2.1’s 200K token context window is powerful, but requires careful prompting to use effectively. Learn how to get Claude to recall an individual sentence across long documents with high fidelity:

40

202

1K

0

1

47

Thomas Ahle

@thomasahle

2 months

@codesonpaper I like Jordan Taylor's inteoduction () My earlier tweets on tensor diagrams are probably also a better place to start than this one.

Graphical tensor notation for interpretability

[ This post is now on arXiv too: https://arxiv.org/abs/2402.01790 ] Deep learning consists almost entirely of operations on or between tensors, so easily understanding tensor operations is pretty...

www.greaterwrong.com

1

5

44

Thomas Ahle

@thomasahle

9 months

Clustering the Sketch: Dynamic Compression for Embedding Tables Paper Website: ArXiv: Embedding tables are used by all machine learning systems that work with categorical data, like user IDs or word tokens. At Meta we had

1

6

44

Thomas Ahle

@thomasahle

5 months

My real gripe with Needle In A Haystack is that there are much better things to do in San Francisco. 🌉🌄

0

2

43

Thomas Ahle

@thomasahle

1 month

Is it inconsistent how 𝚗𝚞𝚖𝚙𝚢.𝚍𝚒𝚊𝚐 behaves on vectors vs matrices? Using tensor-diagrams, it actually makes a lot of sense! And it's easy to generalize the behavior to arbitrary tensor sizes.

1

3

43

Thomas Ahle

@thomasahle

3 years

"In math, when an author starts a sentence with 𝘤𝘭𝘦𝘢𝘳𝘭𝘺, what they are really saying is this seems clear to 𝘮𝘦, and I probably should have checked it, but I got a little confused, so I settle for just asserting that it was clear" - Clearly, @JSEllenberg is on to me...

5

3

43

Thomas Ahle

@thomasahle

9 months

If you sample a random nxn matrix with IID entries from a normal distribution, you get that the condition number, κ, is roughly n: lim_{n→∞} Pr[κ/n < x] = exp(−2/x − 2/x²) For many other sub-gaussian distributions you seem to get similar CDFs, as shown by @AlanEdelmanMIT in

2

5

41

Thomas Ahle

@thomasahle

1 year

LLama 2 uses Grouped Query Attention (Ainslie et al.) which has the benefit of allowing multi-GPU parallelism (one key/value per GPU) while still allowing more queries than key/values, which increases throughput. Better than my idea of just having all queries attend to all keys.

3

9

41

Thomas Ahle

@thomasahle

1 year

I played around with some Sparse Recovery algorithms for to use for matrix recovery (this project: ) This is of course the famous idea by Terence Tao, Tropp and many others. Let me try to explain the 4 most common algorithms, and code them in Python. 👇

Thomas Ahle

@thomasahle

2 years

A frequently overlooked fact is that KMeans is simply a matrix factorization algorithm X ≈ HM, where H is limited to 1 one per row. What if we allowed H to have 2 ones per row instead?👇

12

40

302

1

7

41

Thomas Ahle

@thomasahle

9 months

New tool we made for visualizing thinking in LLMs, including Tree of Thoughs and Reflexion. Together these methods give state of the art code generation and general problem solving.