I've been laid off from Meta. Our entire research/infra org "Probability" was cut.
I deeply appreciate the people who helped me get there, and the great people I worked with for a year and half.
I hope to stay in the Bay Area a while longer, if anyone needs some algorithms.
KANs (NNs with learned functions on the edges) have a quite elegant representation using Tensor Diagrams.
This chart of MLP layers also shows some neat relationship between things like ReGLUs and MoEs.
Doing Matrix Calculus can be messy, specially when we need higher order derivatives.
Writing them out using Tensor Diagrams makes even the Hessian Chain Rule relatively simple:
It is a common misconception that LLMs are just trained to "predict the next token".
No. They are trained to predict an entire context window's worth of tokens, like 4k+. The gradients go end to end and the model is allowed to plan what it will say next.
I always found the tensor notation in Fast Matrix Multiplication algorithms confusing. But using tensor diagrams it's pretty easy to see what's going on:
Retrieval Augmented Generation is such a hack. 🔨
Why would an embedding of your prompt coincide in the embedded space with with documents needed to answering it?
Meanwhile Transformers already have a key/query mechanism build in! 🔋 Can we just put all the key/value pairs into
Extended Mind Transformers (EMTs) are a new approach to working with very large contexts and external data sources developed by
@KlettPhoebe
,
@thomasahle
, Normal's AI team. Inspired by the Extended Mind Thesis, we modify Multihead Attention to directly query a vector database.
The KAN hype has shown many people are thinking transformers still use MLPs‼️
However all the big models we know have switched to GLUs, such as Gemma (GeGLU), LLama (SwiGLU) and Palm (SwiGLU).
These "activation functions" actually take two linear projections and multiply them
My understanding of AlphaGeometry:
1) Translate the problem statement to symbolic form.
2) Try to solve the problem with a symbolic solver.
3) If it didn't work, use a language model to suggest an "auxiliary point" somewhere, such as a midpoint, then go to (2).
To teach the
Introducing AlphaGeometry: an AI system that solves Olympiad geometry problems at a level approaching a human gold-medalist. 📐
It was trained solely on synthetic data and marks a breakthrough for AI in mathematical reasoning. 🧵
I was asked today about great 📚 books about 🎲 probability for students and practitioners interested in Algorithms and ML. These are some of my favorites that I keep coming back to 👇
Needle In A Haystack tests are flawed.
Did you know that the Long-Context Attention in Gemini and GPT-4 is based on inserting the sentence “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day” at a random location in a text?
We’ve seen
Our recent experiments
@NormalComputing
demonstrate that all attention is not equal!
XL-attention (large context windows) struggles with retrieval tasks well within context. Top-k attention (used in our Extended Mind approach) is cheap and effective.
Kronecker products allow you to visualize the entirety of many recursive algorithms in simple tensor diagrams.
Let's try to do this for the famous Fast Fourier Transform.
But first, consider its easier cousin, the Hadamard:
This is analyzed using the "master theorem" of
Anthropic's Mathematical Framework for Transformer Circuits is all about Kronecker Products;
Because they give you a way to model parallel computation (transformer heads).
The Kronecker Product in Linear Algebra is just a tensor product "flattened" on both sides.
We can illustrate this with tensor diagrams, by defining the "flattening tensor", ▷ᵢⱼₖ=[i + j n = k].
Here the Matrix Cookbook section translated into diagram form:
Complexity of Matrix Multiplication ≈ Complexity of Matrix Inversion. Why?
Turns out there is a really neat trick to compute matrix multiplications using inverses:
This can also be used with our "Fast Thermodynamic Matrix inversion": to get Fast
The prospect of using thermodynamic computers for AI applications and probabilistic reasoning is enticing. With linear algebra as a key first step, our theory paper showed asymptotic speedup (relative to digital methods) that scales linearly in dimension
Exciting new paper: TextGrad: Automatic “Differentiation” via Text!
This is a DSPy-like framework for optimizing prompts in a composite LLM system. However, there is one major difference!
In DSPy the idea is (basically):
- Forward pass: Each Module (LLM call) generates picks a
In our recent NeurIPS paper we had to show the following cute inequality:
For a real world application, ask all your friends to think of a number. Divide each number by the sum, and you'll get in expectation at least 1/n. This holds even if you give your friends weights.
Seems
Clustering the Sketch: Dynamic Compression for Embedding Tables
Paper Website:
ArXiv:
Embedding tables are used by all machine learning systems that work with categorical data, like user IDs or word tokens. At Meta we had
A frequently overlooked fact is that KMeans is simply a matrix factorization algorithm X ≈ HM, where H is limited to 1 one per row.
What if we allowed H to have 2 ones per row instead?👇
@karpathy
Of all the algorthms you learn in CS, who would have thought Topological Sort would be the one to continuously spawn billion dollar companies?
Saying "Synthetic data generation is just GPT-4 distillation" is like saying "Writing textbooks is just Human distillation." A lot goes in to creating good synthetic datasets!
The Self-Prompt technique by Li et al. is a cool example of generating a hard Q&A dataset.
Let's walk
Activation Functions differ drastically in how the MLP maps output at initialization.
Inspired by
@kellerjordan0
() I initialized some "deep and wide" MLPs, and observed how the angles and norms of points develop through the network.
Surprisingly, GeLU
I'm just starting to learn about this neural tangent kernel / NNGP / tensor programs stuff--
It was pretty amusing to find out that deep MLPs have nearly-constant output at initialization
The "Trace" section of the The Matrix Cookbook translated into tensor diagrams.
The nice thing about the diagrams is that they show why the formulas are true.
Enter thermodynamic computing. In this preprint, Thermodynamic Linear Algebra (), we show that a system of coupled oscillators in contact with a heat reservoir can be used to solve linear systems in an amount of time proportional to the number of variables.
The Kronecker Product in Linear Algebra is just a tensor product "flattened" on both sides.
We can illustrate this with tensor diagrams, by defining the "flattening tensor", ▷ᵢⱼₖ=[i + j n = k].
Here the Matrix Cookbook section translated into diagram form:
When I started in NLP, word vectors were state of the art thinking. 🧠
In Oxford we were very much still doing grammar based computational linguistics, but I remember one talk that suggested a middle path.💡
The idea was that if a noun phrase is a vector, then an adjective must
I just learned about
@jcvbcn
and
@HongxunWu
's beautiful algorithm for Subset Sum in Õ(n+t) time, where t is the target sum.
Recall the classical dynamic programming solution (by Bellman 1957) takes O(n⋅t) time.
How is this improvement possible?
1/n
We took a large file, made a copy with some edits, and asked Gemini Pro 1.5 to find the differences...
It completely failed. Not only did it not find any of the differences, it returned a long list of completely hallucinated changes!
This shows that long contexts, while good for
Per Thomas's suggestion, I ran this experiment with
@google
Gemini Pro 1.5, the LLM with longest context as of today.
I grabbed a lengthy gov document, made a copy with 10 random text edits.
The result is... 100% hallucination, unfortunately.
@johann_josefy
@Carnage4Life
Have they ever closed a service and not given you a chance to download and migrate?
If that ever happens, just buy an SSD and download to that 🤷♂️
SELUs were a nice activation function because they preserved the mean and variance of inputs, E[x]=0, E[x²]=1.
However, they looked weird and didn't go to 0 as x → −∞. Just look at it:
Can we fix this?
Most people today use Swish and/or Gelu. (At least in major transformer
Can we use LLMs to discover better PyTorch programs?
This weekend,
@yaroslavvb
challenged me to use an LLM to find a model that trains CIFAR-10 to 95% accuracy in 5s on an A100. (Like
@kellerjordan0
did.) I didn't exactly succeed, but I did find some pretty fun programs.
Let's
@sharongoldman
19 people doing Bayesian Modeling, 9 people doing Ranking and Recommendations, 5 people doing ML Efficiency, 17 people doing AI for Chip Design and Compilers. Plus managers and such.
VLMs aren't blind:
@danielcorin1
Changing the prompt from "How many times do the blue and red lines intersect?" to "How many times do the blue and red line plots cross each other?" increases the accuracy of Claude 3.5 Sonnet from 73% to 95%.
Benchmarking
@thomasahle
I disagree. “+” makes sense linguistically and mathematically in almost all uses. Same with “-“. “@“ dose not linguistically read right, we read the common @ far more than weird python “@“.
If it actually made sense, people would use it more.
Andrew's list for working with large context models:
(1) Write quick, simple prompts
(2) Iteratively, flesh out a mega-prompt
(3) Few-shot or many-shot examples
(4) Break into subtasks / agentic workflow
I want to suggest an alternative "Eval Driven" workflow:
(1) Write quick,
This week, Google announced a doubling of Gemini Pro 1.5's input context window from 1 million to 2 million tokens, and OpenAI released GPT-4o, which generates tokens 2x faster and 50% cheaper than GPT-4 Turbo and natively accepts and generates multimodal tokens. I view these
When I first drew Strassen's algorithm as a Tensor Diagram, I chickened out at the last step.
But here I give you the complete diagram: "Strassen's Kringle".
I always found the tensor notation in Fast Matrix Multiplication algorithms confusing. But using tensor diagrams it's pretty easy to see what's going on:
I did a new analysis of our Thermodynamic Linear Algebra algorithm, based on continuously integrating a simple stochastic differential equation.
It's interesting that if you want to find x such that ‖Ax−b‖₂ < ε, you can do it in time O(d ε⁻²) on a machine like
This is why stuff like Medusa works: You can add extra decoder heads and predict multiple future tokens at once.
Really it's crazy to think you could produce meaningful text if you only think about "the next token".
@Thom_Wolf
I'm surprised the MuMath-Code paper wasn't more discussed on Twitter. It is awesome! Super dense in information.
Only reference I found was: who incidentally tweets about a lot of really interesting papers that don't get much attention.
New algorithm by Andoni and Beaglehole uses Multiplicative weight update to optimize Locality Sensitive Hashing for a given dataset.
This gives a practical, yet robust solution to high-dimensional nearest neighbour search.
1/2
#arXiv
#machinelearning
[cs.LG] Learning to Hash Robustly, with Guarantees. (arXiv:2108.05433v1 [cs.DS])
The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Local…
@LongFormMath
Roses are red, violets are blue,
Calculus is the poetry that nature imbues,
Integration by parts is a technique divine,
It brings us closer to the truth, like two souls intertwined. 🤖
Here is a simple 3-vector function that should be linear memory, but is quadratic memory in torch or numpy:
𝚛𝚎𝚕𝚞(𝚡.𝚞𝚗𝚜𝚚𝚞𝚎𝚎𝚣𝚎(𝟶) + 𝚢.𝚞𝚗𝚜𝚚𝚞𝚎𝚎𝚣𝚎(𝟷)) @ 𝚣
Is there a trick to make this linear memory without using python loops, or writing a new cuda kernel?
What is the hackiest MNist classifier that gets > 10% accuracy?
For example, taking the mean value of each image, and using a nearest centroid classifier, gives 22% accuracy.
If you like the graphical Hessian Chain Rule, you may be interested in
@yaroslavvb
's question (here ) on how to actually compute it most efficiently.
This relates to the Matrix Chain Problem, which has a nice dynamic programming solution, but for sums of
Doing Matrix Calculus can be messy, specially when we need higher order derivatives.
Writing them out using Tensor Diagrams makes even the Hessian Chain Rule relatively simple:
This week I trained an 800K transformer to learn 5 digit multiplication.
Then I replaced "xxx*yyy=?" with "xxx*yyy_____=?", giving the model 5 extra tokens for "computation".
Now the model quickly learned 7 digit multiplication.
It's a nice trick.
"Chain of thought" allows model to "think" more, or expend more FLOPS, thereby improving performance
Does this imply that giving LLMs large amounts of padding tokens will improve performance as well? 🤔
Also forces increased FLOPs in computing the answer.
The fourth moment bound generalizes Cantelli's inequality to give lower bounds on "tail" probabilities even when the mean is zero.
But what if you want P[X ≥ λ] ≥ ...?
I keep seeing code like torch.matmul(A, B) or matmul(A, B). In Hugging Face and other open source libraries.
Why no love for the operator form, A @ B ?
@ccanonne_
Represent the sets as bit-strings. For each position you allow three cases: (0,0), (0,1), (1,1), but not (1,0). Each bit is independent so you get (3/4)^n.
Definitely check out "The Probabilistic Method" by Joel Spencer and Noga Alon. The first four chapters really give a nice idea of the algorithmic stuff you can do with probability. The remaining chapters show you how wide the field is. Every proof is "from the book".
We know A⁻¹x can be computed in n²√κ time using the conjugate gradient method.
But what about other powers, like A⁰ᐧ⁵x?
Turns out that yes! But it requires some pretty weird contour integrals and elliptic functions:
Being an Open Source maintainer in 2024:
User: My code doesn't work
Me: Where did you get this code from? That doesn't look like our API at all
User: GPT
Me: Do you mind reading our docs instead?
User: ...
Do you prefer your math formulas look like this, rather than long pages of incomprehensible matrix multiplications?
Then you might want to try tensorgrad:
It's a library for symbolic tensor manipulation and differentiation I just made! I'll post more
Alternatively, we can write out all the terms in the cubic formula (without biases) like this.
I wonder if there's some kind of series expansion rule in play here?
Did anybody try using genetic programming to improve LLM Agent's prompts?
You let a bunch of them run with somewhat different prompts/rules/guidelines. Then combine the best pairs to form the next generation.
You could also just make mutations (asexual reproduction), that gives
Linear systems, Ax=b, can be solved in O(n²√κ) time (using Conjugate gradient), so matrices with low condition number (κ) can be solved much faster than n^ω (matrix inversion or decomposition).
Can other linear algebra problems also be solved faster?
I will be speaking today at the Sydney Algorithms Seminar on Tensor Sketching with applications to kernel tricks and neural network compression.
Join in at 11am Sydney time :-) it's cool stuff. Thanks for
@ccanonne_
for organizing.
Yesterday I left
@Meta
(after being laid off in November and rehired in January) to join
@NormalComputing
.
Normal is kinda in stealth right now, but we'll share more soon! I'm excited to be working with
@FarisSbahi
,
@ColesThermoAI
,
@remilouf
and the rest of the amazing team!
Has anyone seen this distribution?
It is the (numerically) closest distribution (in l1 norm) to uniform that can be written as the sum of two identically distributed random variables.
But is it known? Does it have a name? A family?
Making a synthetic dataset of mathematical proofs is hard! It's easy to make a whole lot of "1+1+1+...=491" style theorems.
I'm surprised this method of random construction and transformation finds so many classical geometric theorems.
Maybe because the domain is somewhat
@ellewoodsgolfs
@ProfRobAnderson
Of course it's a joke, but it makes a good point about a lot of faculty being severely underpaid.
If universities don't want to pay lecturers, maybe students need to start tipping.
@roydanroy
Even in 2020 we already had: Sparse Transformers (Child et al., 2019), Reformer (Kitaev et al., 2020), Linformer (Wang et al., 2020), Longformer (Beltagy et al., 2020), Sinkhorn Transformers (Tay et al., 2020b), Performers (Choromanski et al., 2020b), Synthesizers (Tay et al.,
The Fast Hadamard transform is one of those algorithms that are "so simple it must be optimal". A perfect example of a recursive algorithm.
However,
@firebat03
just showed that it can be improved using entirely non-simple techniques:
Some more notes, the model is quite small, 12 layers & 1,024 dim. This should be encouraging for researchers.
On the other hand, they don't just greedily decode the model, but use a beam search over 512 generations. Makes sense to extract the highest quality auxiliaries before
I'm fascinated by the idea of "bidirectional parsing":
It's something like a combination of a templating language and a parser. If we could use this for
#dspy
, the LLM could optimize its own prompting templates, and parsing outputs would come for free.
There are so many great "Named Tensors" libraries.
Why did none of them take off?
Named Tensor:
Tensor Shape Annotations:
Axis Arrays:
Pytorch's named tensors:
Claude 2.1’s 200K token context window is powerful, but requires careful prompting to use effectively.
Learn how to get Claude to recall an individual sentence across long documents with high fidelity:
Clustering the Sketch: Dynamic Compression for Embedding Tables
Paper Website:
ArXiv:
Embedding tables are used by all machine learning systems that work with categorical data, like user IDs or word tokens. At Meta we had
Is it inconsistent how 𝚗𝚞𝚖𝚙𝚢.𝚍𝚒𝚊𝚐 behaves on vectors vs matrices?
Using tensor-diagrams, it actually makes a lot of sense!
And it's easy to generalize the behavior to arbitrary tensor sizes.
"In math, when an author starts a sentence with 𝘤𝘭𝘦𝘢𝘳𝘭𝘺, what they are really saying is this seems clear to 𝘮𝘦, and I probably should have checked it, but I got a little confused, so I settle for just asserting that it was clear" -
Clearly,
@JSEllenberg
is on to me...
If you sample a random nxn matrix with IID entries from a normal distribution, you get that the condition number, κ, is roughly n:
lim_{n→∞} Pr[κ/n < x] = exp(−2/x − 2/x²)
For many other sub-gaussian distributions you seem to get similar CDFs, as shown by
@AlanEdelmanMIT
in
LLama 2 uses Grouped Query Attention (Ainslie et al.) which has the benefit of allowing multi-GPU parallelism (one key/value per GPU) while still allowing more queries than key/values, which increases throughput.
Better than my idea of just having all queries attend to all keys.
I played around with some Sparse Recovery algorithms for to use for matrix recovery (this project: )
This is of course the famous idea by Terence Tao, Tropp and many others.
Let me try to explain the 4 most common algorithms, and code them in Python. 👇
A frequently overlooked fact is that KMeans is simply a matrix factorization algorithm X ≈ HM, where H is limited to 1 one per row.
What if we allowed H to have 2 ones per row instead?👇
New tool we made for visualizing thinking in LLMs, including Tree of Thoughs and Reflexion. Together these methods give state of the art code generation and general problem solving.