Thomas Ahle Profile Banner
Thomas Ahle Profile
Thomas Ahle

@thomasahle

6,127
Followers
563
Following
612
Media
3,683
Statuses

Head of ML @NormalComputing . Ex @Meta , @BARCdk , SupWiz, @OxfordQuantum . Tweets on Math, AI, #dspy , Probability, ML, Algorithms and Randomness. Recently tensors.

San Francisco
Joined September 2010
Don't wanna be here? Send us removal request.
@thomasahle
Thomas Ahle
1 month
Sam Altman: Ilya is leaving to work on a small project that's personally meaningful to him. Ilya: Single Shot Superintelligence
@ilyasut
Ilya Sutskever
1 month
I am starting a new company:
2K
3K
31K
13
84
2K
@thomasahle
Thomas Ahle
2 years
I've been laid off from Meta. Our entire research/infra org "Probability" was cut. I deeply appreciate the people who helped me get there, and the great people I worked with for a year and half. I hope to stay in the Bay Area a while longer, if anyone needs some algorithms.
40
64
1K
@thomasahle
Thomas Ahle
2 months
KANs (NNs with learned functions on the edges) have a quite elegant representation using Tensor Diagrams. This chart of MLP layers also shows some neat relationship between things like ReGLUs and MoEs.
Tweet media one
8
184
1K
@thomasahle
Thomas Ahle
1 year
Jax has a different definition of the identity matrix than I'm used to...
Tweet media one
27
52
899
@thomasahle
Thomas Ahle
3 months
Doing Matrix Calculus can be messy, specially when we need higher order derivatives. Writing them out using Tensor Diagrams makes even the Hessian Chain Rule relatively simple:
Tweet media one
15
123
838
@thomasahle
Thomas Ahle
10 months
It is a common misconception that LLMs are just trained to "predict the next token". No. They are trained to predict an entire context window's worth of tokens, like 4k+. The gradients go end to end and the model is allowed to plan what it will say next.
44
57
797
@thomasahle
Thomas Ahle
2 months
I always found the tensor notation in Fast Matrix Multiplication algorithms confusing. But using tensor diagrams it's pretty easy to see what's going on:
Tweet media one
7
101
787
@thomasahle
Thomas Ahle
9 months
Lovely table of ways to compute e^A, the exponential of a matrix, by @nhigham
Tweet media one
12
91
690
@thomasahle
Thomas Ahle
9 months
Retrieval Augmented Generation is such a hack. 🔨 Why would an embedding of your prompt coincide in the embedded space with with documents needed to answering it? Meanwhile Transformers already have a key/query mechanism build in! 🔋 Can we just put all the key/value pairs into
@NormalComputing
Normal Computing 🧠🌡️
9 months
Extended Mind Transformers (EMTs) are a new approach to working with very large contexts and external data sources developed by @KlettPhoebe , @thomasahle , Normal's AI team. Inspired by the Extended Mind Thesis, we modify Multihead Attention to directly query a vector database.
Tweet media one
5
43
220
20
73
604
@thomasahle
Thomas Ahle
2 months
Using 𝚝𝚘𝚛𝚌𝚑.𝚌𝚘𝚖𝚙𝚒𝚕𝚎 makes KANs as fast as MLPs! I never thought I would be a fan, but they are starting to look pretty appetizing.
Tweet media one
Tweet media two
10
76
509
@thomasahle
Thomas Ahle
3 months
The KAN hype has shown many people are thinking transformers still use MLPs‼️ However all the big models we know have switched to GLUs, such as Gemma (GeGLU), LLama (SwiGLU) and Palm (SwiGLU). These "activation functions" actually take two linear projections and multiply them
Tweet media one
13
78
486
@thomasahle
Thomas Ahle
7 months
My understanding of AlphaGeometry: 1) Translate the problem statement to symbolic form. 2) Try to solve the problem with a symbolic solver. 3) If it didn't work, use a language model to suggest an "auxiliary point" somewhere, such as a midpoint, then go to (2). To teach the
Tweet media one
Tweet media two
Tweet media three
Tweet media four
@GoogleDeepMind
Google DeepMind
7 months
Introducing AlphaGeometry: an AI system that solves Olympiad geometry problems at a level approaching a human gold-medalist. 📐 It was trained solely on synthetic data and marks a breakthrough for AI in mathematical reasoning. 🧵
126
1K
4K
8
82
474
@thomasahle
Thomas Ahle
2 years
I was asked today about great 📚 books about 🎲 probability for students and practitioners interested in Algorithms and ML. These are some of my favorites that I keep coming back to 👇
5
76
418
@thomasahle
Thomas Ahle
13 days
@deedydas Well done US team!
Tweet media one
9
26
421
@thomasahle
Thomas Ahle
5 months
Needle In A Haystack tests are flawed. Did you know that the Long-Context Attention in Gemini and GPT-4 is based on inserting the sentence “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day” at a random location in a text? We’ve seen
@KlettPhoebe
Phoebe Klett
5 months
Our recent experiments @NormalComputing demonstrate that all attention is not equal! XL-attention (large context windows) struggles with retrieval tasks well within context. Top-k attention (used in our Extended Mind approach) is cheap and effective.
Tweet media one
6
29
164
17
61
392
@thomasahle
Thomas Ahle
2 months
Kronecker products allow you to visualize the entirety of many recursive algorithms in simple tensor diagrams. Let's try to do this for the famous Fast Fourier Transform. But first, consider its easier cousin, the Hadamard: This is analyzed using the "master theorem" of
Tweet media one
Tweet media two
Tweet media three
Tweet media four
5
54
359
@thomasahle
Thomas Ahle
3 months
Anthropic's Mathematical Framework for Transformer Circuits is all about Kronecker Products; Because they give you a way to model parallel computation (transformer heads).
Tweet media one
@thomasahle
Thomas Ahle
3 months
The Kronecker Product in Linear Algebra is just a tensor product "flattened" on both sides. We can illustrate this with tensor diagrams, by defining the "flattening tensor", ▷ᵢⱼₖ=[i + j n = k]. Here the Matrix Cookbook section translated into diagram form:
Tweet media one
3
25
165
2
64
349
@thomasahle
Thomas Ahle
9 months
Complexity of Matrix Multiplication ≈ Complexity of Matrix Inversion. Why? Turns out there is a really neat trick to compute matrix multiplications using inverses: This can also be used with our "Fast Thermodynamic Matrix inversion": to get Fast
Tweet media one
Tweet media two
@ColesThermoAI
Patrick Coles
9 months
The prospect of using thermodynamic computers for AI applications and probabilistic reasoning is enticing. With linear algebra as a key first step, our theory paper showed asymptotic speedup (relative to digital methods) that scales linearly in dimension
Tweet media one
2
9
54
3
53
336
@thomasahle
Thomas Ahle
2 months
Exciting new paper: TextGrad: Automatic “Differentiation” via Text! This is a DSPy-like framework for optimizing prompts in a composite LLM system. However, there is one major difference! In DSPy the idea is (basically): - Forward pass: Each Module (LLM call) generates picks a
Tweet media one
Tweet media two
Tweet media three
16
63
308
@thomasahle
Thomas Ahle
9 months
In our recent NeurIPS paper we had to show the following cute inequality: For a real world application, ask all your friends to think of a number. Divide each number by the sum, and you'll get in expectation at least 1/n. This holds even if you give your friends weights. Seems
Tweet media one
Tweet media two
Tweet media three
@thomasahle
Thomas Ahle
9 months
Clustering the Sketch: Dynamic Compression for Embedding Tables Paper Website: ArXiv: Embedding tables are used by all machine learning systems that work with categorical data, like user IDs or word tokens. At Meta we had
Tweet media one
1
6
44
3
38
298
@thomasahle
Thomas Ahle
2 years
A frequently overlooked fact is that KMeans is simply a matrix factorization algorithm X ≈ HM, where H is limited to 1 one per row. What if we allowed H to have 2 ones per row instead?👇
Tweet media one
12
40
302
@thomasahle
Thomas Ahle
5 months
@kaseyklimes We're trying
Tweet media one
Tweet media two
Tweet media three
Tweet media four
5
7
245
@thomasahle
Thomas Ahle
1 month
@karpathy Of all the algorthms you learn in CS, who would have thought Topological Sort would be the one to continuously spawn billion dollar companies?
1
13
218
@thomasahle
Thomas Ahle
16 days
Saying "Synthetic data generation is just GPT-4 distillation" is like saying "Writing textbooks is just Human distillation." A lot goes in to creating good synthetic datasets! The Self-Prompt technique by Li et al. is a cool example of generating a hard Q&A dataset. Let's walk
Tweet media one
5
30
193
@thomasahle
Thomas Ahle
26 days
Activation Functions differ drastically in how the MLP maps output at initialization. Inspired by @kellerjordan0 () I initialized some "deep and wide" MLPs, and observed how the angles and norms of points develop through the network. Surprisingly, GeLU
Tweet media one
@kellerjordan0
Keller Jordan
30 days
I'm just starting to learn about this neural tangent kernel / NNGP / tensor programs stuff-- It was pretty amusing to find out that deep MLPs have nearly-constant output at initialization
Tweet media one
3
8
172
7
33
183
@thomasahle
Thomas Ahle
4 months
Memorizing a few linear stochastic matrix expressions is pretty useful. But what about higher order formulas? Things quickly get weird... 🧵1/4
Tweet media one
1
24
181
@thomasahle
Thomas Ahle
3 months
The "Trace" section of the The Matrix Cookbook translated into tensor diagrams. The nice thing about the diagrams is that they show why the formulas are true.
Tweet media one
8
11
175
@thomasahle
Thomas Ahle
11 months
Just got my first actual real samples back from our thermodynamic chip 🤯
Tweet media one
@MaxAifer
Max Aifer
1 year
Enter thermodynamic computing. In this preprint, Thermodynamic Linear Algebra (), we show that a system of coupled oscillators in contact with a heat reservoir can be used to solve linear systems in an amount of time proportional to the number of variables.
Tweet media one
11
79
401
3
26
169
@thomasahle
Thomas Ahle
3 years
@blekhman It's called the Prof. Dr. Style, and designers study it as one of the most authentic parts of the web:
4
18
169
@thomasahle
Thomas Ahle
3 months
The Kronecker Product in Linear Algebra is just a tensor product "flattened" on both sides. We can illustrate this with tensor diagrams, by defining the "flattening tensor", ▷ᵢⱼₖ=[i + j n = k]. Here the Matrix Cookbook section translated into diagram form:
Tweet media one
3
25
165
@thomasahle
Thomas Ahle
6 months
Trying to combine DSPy, Pydantic types and JSON Schemas
Tweet media one
8
12
155
@thomasahle
Thomas Ahle
8 months
When I started in NLP, word vectors were state of the art thinking. 🧠 In Oxford we were very much still doing grammar based computational linguistics, but I remember one talk that suggested a middle path.💡 The idea was that if a noun phrase is a vector, then an adjective must
@AravSrinivas
Aravind Srinivas
8 months
Tomas Mikolov, the OG and inventor of word2vec, gives this thoughts on the test of time award, and the current state of NLP, and chatGPT. 🍿
Tweet media one
27
176
1K
2
17
152
@thomasahle
Thomas Ahle
3 years
I just learned about @jcvbcn and @HongxunWu 's beautiful algorithm for Subset Sum in Õ(n+t) time, where t is the target sum. Recall the classical dynamic programming solution (by Bellman 1957) takes O(n⋅t) time. How is this improvement possible? 1/n
Tweet media one
3
40
148
@thomasahle
Thomas Ahle
5 months
We took a large file, made a copy with some edits, and asked Gemini Pro 1.5 to find the differences... It completely failed. Not only did it not find any of the differences, it returned a long list of completely hallucinated changes! This shows that long contexts, while good for
@mengdi_en
Mengdi Chen
5 months
Per Thomas's suggestion, I ran this experiment with @google Gemini Pro 1.5, the LLM with longest context as of today. I grabbed a lengthy gov document, made a copy with 10 random text edits. The result is... 100% hallucination, unfortunately.
Tweet media one
3
4
43
12
22
134
@thomasahle
Thomas Ahle
1 year
Relationship ended with Reviewer #2 . Now Reviewer #4 is my worst enemy. #NeurIPS2023
Tweet media one
6
1
133
@thomasahle
Thomas Ahle
15 days
@johann_josefy @Carnage4Life Have they ever closed a service and not given you a chance to download and migrate? If that ever happens, just buy an SSD and download to that 🤷‍♂️
8
0
134
@thomasahle
Thomas Ahle
2 months
SELUs were a nice activation function because they preserved the mean and variance of inputs, E[x]=0, E[x²]=1. However, they looked weird and didn't go to 0 as x → −∞. Just look at it: Can we fix this? Most people today use Swish and/or Gelu. (At least in major transformer
Tweet media one
Tweet media two
Tweet media three
Tweet media four
7
16
127
@thomasahle
Thomas Ahle
5 months
Can we use LLMs to discover better PyTorch programs? This weekend, @yaroslavvb challenged me to use an LLM to find a model that trains CIFAR-10 to 95% accuracy in 5s on an A100. (Like @kellerjordan0 did.) I didn't exactly succeed, but I did find some pretty fun programs. Let's
Tweet media one
1
14
124
@thomasahle
Thomas Ahle
2 years
@sharongoldman 19 people doing Bayesian Modeling, 9 people doing Ranking and Recommendations, 5 people doing ML Efficiency, 17 people doing AI for Chip Design and Compilers. Plus managers and such.
12
7
119
@thomasahle
Thomas Ahle
15 days
VLMs aren't blind: @danielcorin1 Changing the prompt from "How many times do the blue and red lines intersect?" to "How many times do the blue and red line plots cross each other?" increases the accuracy of Claude 3.5 Sonnet from 73% to 95%. Benchmarking
Tweet media one
Tweet media two
8
10
120
@thomasahle
Thomas Ahle
1 year
For those saying the "@" operator is hard to read, the PEP () has some pretty good examples to the contrary.
Tweet media one
@EdwardRaffML
Edward Raff
1 year
@thomasahle I disagree. “+” makes sense linguistically and mathematically in almost all uses. Same with “-“. “@“ dose not linguistically read right, we read the common @ far more than weird python “@“. If it actually made sense, people would use it more.
2
0
7
14
7
115
@thomasahle
Thomas Ahle
2 months
Karger's Algorithm for Min Cut. Me trying to learn Manim.
6
9
117
@thomasahle
Thomas Ahle
1 year
Have you ever derived and coded backprop from scratch?
Yes
1858
No
1053
Show
555
73
6
111
@thomasahle
Thomas Ahle
3 months
Andrew's list for working with large context models: (1) Write quick, simple prompts (2) Iteratively, flesh out a mega-prompt (3) Few-shot or many-shot examples (4) Break into subtasks / agentic workflow I want to suggest an alternative "Eval Driven" workflow: (1) Write quick,
@AndrewYNg
Andrew Ng
3 months
This week, Google announced a doubling of Gemini Pro 1.5's input context window from 1 million to 2 million tokens, and OpenAI released GPT-4o, which generates tokens 2x faster and 50% cheaper than GPT-4 Turbo and natively accepts and generates multimodal tokens. I view these
77
587
3K
5
29
109
@thomasahle
Thomas Ahle
17 days
When I first drew Strassen's algorithm as a Tensor Diagram, I chickened out at the last step. But here I give you the complete diagram: "Strassen's Kringle".
Tweet media one
@thomasahle
Thomas Ahle
2 months
I always found the tensor notation in Fast Matrix Multiplication algorithms confusing. But using tensor diagrams it's pretty easy to see what's going on:
Tweet media one
7
101
787
3
21
111
@thomasahle
Thomas Ahle
1 year
Okay, maybe just don't use jax-metal quite jet.
@thomasahle
Thomas Ahle
1 year
@dan_p_simpson Ok, maybe this is just a bug in the new Metal backend.
Tweet media one
1
1
29
3
6
99
@thomasahle
Thomas Ahle
6 months
I did a new analysis of our Thermodynamic Linear Algebra algorithm, based on continuously integrating a simple stochastic differential equation. It's interesting that if you want to find x such that ‖Ax−b‖₂ < ε, you can do it in time O(d ε⁻²) on a machine like
Tweet media one
Tweet media two
4
19
98
@thomasahle
Thomas Ahle
10 months
This is why stuff like Medusa works: You can add extra decoder heads and predict multiple future tokens at once. Really it's crazy to think you could produce meaningful text if you only think about "the next token".
4
1
96
@thomasahle
Thomas Ahle
26 days
@Thom_Wolf I'm surprised the MuMath-Code paper wasn't more discussed on Twitter. It is awesome! Super dense in information. Only reference I found was: who incidentally tweets about a lot of really interesting papers that don't get much attention.
@gm8xx8
𝚐𝔪𝟾𝚡𝚡𝟾
3 months
MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning ↓
0
10
58
1
7
89
@thomasahle
Thomas Ahle
3 months
Basic Derivatives from the Matrix Cookbook in Tensor Diagram form.
Tweet media one
2
8
84
@thomasahle
Thomas Ahle
8 months
Just adding some data from our internal "reasoning under uncertainty" benchmarks to the debate of "Is ChatGPT Getting Worse?"
Tweet media one
6
11
83
@thomasahle
Thomas Ahle
3 years
New algorithm by Andoni and Beaglehole uses Multiplicative weight update to optimize Locality Sensitive Hashing for a given dataset. This gives a practical, yet robust solution to high-dimensional nearest neighbour search. 1/2
@arXiv__ml
Machine Learning | arXiv
3 years
#arXiv #machinelearning [cs.LG] Learning to Hash Robustly, with Guarantees. (arXiv:2108.05433v1 [cs.DS]) The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Local…
0
6
18
2
19
83
@thomasahle
Thomas Ahle
4 months
Fully Homomorphic Encryption with GPT 🥳
Tweet media one
Tweet media two
Tweet media three
3
7
79
@thomasahle
Thomas Ahle
1 year
@LongFormMath Roses are red, violets are blue, Calculus is the poetry that nature imbues, Integration by parts is a technique divine, It brings us closer to the truth, like two souls intertwined. 🤖
0
6
80
@thomasahle
Thomas Ahle
2 months
Here is a simple 3-vector function that should be linear memory, but is quadratic memory in torch or numpy: 𝚛𝚎𝚕𝚞(𝚡.𝚞𝚗𝚜𝚚𝚞𝚎𝚎𝚣𝚎(𝟶) + 𝚢.𝚞𝚗𝚜𝚚𝚞𝚎𝚎𝚣𝚎(𝟷)) @ 𝚣 Is there a trick to make this linear memory without using python loops, or writing a new cuda kernel?
8
3
79
@thomasahle
Thomas Ahle
12 days
Taking the hessian of aᵀX²b: (1) using the perturbation method; (2) using index notation; and (3) using tensor diagrams:
Tweet media one
Tweet media two
Tweet media three
3
19
108
@thomasahle
Thomas Ahle
1 year
What is the hackiest MNist classifier that gets > 10% accuracy? For example, taking the mean value of each image, and using a nearest centroid classifier, gives 22% accuracy.
20
5
75
@thomasahle
Thomas Ahle
2 months
If you like the graphical Hessian Chain Rule, you may be interested in @yaroslavvb 's question (here ) on how to actually compute it most efficiently. This relates to the Matrix Chain Problem, which has a nice dynamic programming solution, but for sums of
Tweet media one
@thomasahle
Thomas Ahle
3 months
Doing Matrix Calculus can be messy, specially when we need higher order derivatives. Writing them out using Tensor Diagrams makes even the Hessian Chain Rule relatively simple:
Tweet media one
15
123
838
0
10
75
@thomasahle
Thomas Ahle
11 months
This week I trained an 800K transformer to learn 5 digit multiplication. Then I replaced "xxx*yyy=?" with "xxx*yyy_____=?", giving the model 5 extra tokens for "computation". Now the model quickly learned 7 digit multiplication. It's a nice trick.
Tweet media one
@mathemagic1an
Jay Hack
1 year
"Chain of thought" allows model to "think" more, or expend more FLOPS, thereby improving performance Does this imply that giving LLMs large amounts of padding tokens will improve performance as well? 🤔 Also forces increased FLOPs in computing the answer.
Tweet media one
16
12
90
4
5
74
@thomasahle
Thomas Ahle
1 year
The fourth moment bound generalizes Cantelli's inequality to give lower bounds on "tail" probabilities even when the mean is zero. But what if you want P[X ≥ λ] ≥ ...?
Tweet media one
5
4
70
@thomasahle
Thomas Ahle
1 year
And using .eye doesn't help.
Tweet media one
2
1
71
@thomasahle
Thomas Ahle
1 year
@BlackHC @kgourg Yeah, didn't find the review particularly meaningful... (Reposted to hide submission id)
Tweet media one
9
4
72
@thomasahle
Thomas Ahle
1 year
Fine, I'll just make a zero array and assign 1 to the diagonal, right? Nop.
Tweet media one
3
0
67
@thomasahle
Thomas Ahle
1 year
I keep seeing code like torch.matmul(A, B) or matmul(A, B). In Hugging Face and other open source libraries. Why no love for the operator form, A @ B ?
12
3
65
@thomasahle
Thomas Ahle
2 years
@ccanonne_ Represent the sets as bit-strings. For each position you allow three cases: (0,0), (0,1), (1,1), but not (1,0). Each bit is independent so you get (3/4)^n.
4
0
66
@thomasahle
Thomas Ahle
2 years
Definitely check out "The Probabilistic Method" by Joel Spencer and Noga Alon. The first four chapters really give a nice idea of the algorithmic stuff you can do with probability. The remaining chapters show you how wide the field is. Every proof is "from the book".
Tweet media one
2
3
65
@thomasahle
Thomas Ahle
1 year
We know A⁻¹x can be computed in n²√κ time using the conjugate gradient method. But what about other powers, like A⁰ᐧ⁵x? Turns out that yes! But it requires some pretty weird contour integrals and elliptic functions:
4
5
63
@thomasahle
Thomas Ahle
5 months
Being an Open Source maintainer in 2024: User: My code doesn't work Me: Where did you get this code from? That doesn't look like our API at all User: GPT Me: Do you mind reading our docs instead? User: ...
4
3
62
@thomasahle
Thomas Ahle
4 months
Do you prefer your math formulas look like this, rather than long pages of incomprehensible matrix multiplications? Then you might want to try tensorgrad: It's a library for symbolic tensor manipulation and differentiation I just made! I'll post more
@thomasahle
Thomas Ahle
4 months
Alternatively, we can write out all the terms in the cubic formula (without biases) like this. I wonder if there's some kind of series expansion rule in play here?
Tweet media one
2
1
9
3
10
63
@thomasahle
Thomas Ahle
7 months
Did anybody try using genetic programming to improve LLM Agent's prompts? You let a bunch of them run with somewhat different prompts/rules/guidelines. Then combine the best pairs to form the next generation. You could also just make mutations (asexual reproduction), that gives
18
8
59
@thomasahle
Thomas Ahle
3 months
Many-shot in-context learning is the new fine-tuning. And #DSPy is the framework to make it fun.
Tweet media one
3
6
58
@thomasahle
Thomas Ahle
1 year
Great note by @yaroslavvb on Tensor Networks and Autodiff: It includes the only understandable "Chain Rule for Hessians" I've ever seen.
Tweet media one
1
11
57
@thomasahle
Thomas Ahle
3 months
Some basic rules for Tensor Diagram simplification, illustrated as graphs
Tweet media one
Tweet media two
Tweet media three
5
8
56
@thomasahle
Thomas Ahle
1 year
Linear systems, Ax=b, can be solved in O(n²√κ) time (using Conjugate gradient), so matrices with low condition number (κ) can be solved much faster than n^ω (matrix inversion or decomposition). Can other linear algebra problems also be solved faster?
3
5
56
@thomasahle
Thomas Ahle
3 years
I will be speaking today at the Sydney Algorithms Seminar on Tensor Sketching with applications to kernel tricks and neural network compression. Join in at 11am Sydney time :-) it's cool stuff. Thanks for @ccanonne_ for organizing.
Tweet media one
3
7
52
@thomasahle
Thomas Ahle
9 months
2-Dimensional random walks eventually always return... But it takes a long time.
Tweet media one
@fermatslibrary
Fermat's Library
9 months
Probability of returning to the origin in a random walk: 1D → P=1 2D → P=1 3D → P=0.34 Large D → P=1/2D
Tweet media one
76
229
3K
2
1
54
@thomasahle
Thomas Ahle
1 year
Yesterday I left @Meta (after being laid off in November and rehired in January) to join @NormalComputing . Normal is kinda in stealth right now, but we'll share more soon! I'm excited to be working with @FarisSbahi , @ColesThermoAI , @remilouf and the rest of the amazing team!
4
1
53
@thomasahle
Thomas Ahle
5 months
DSPy supports arbitrary functional validators.
Tweet media one
3
4
55
@thomasahle
Thomas Ahle
2 years
Has anyone seen this distribution? It is the (numerically) closest distribution (in l1 norm) to uniform that can be written as the sum of two identically distributed random variables. But is it known? Does it have a name? A family?
Tweet media one
7
1
51
@thomasahle
Thomas Ahle
7 months
Making a synthetic dataset of mathematical proofs is hard! It's easy to make a whole lot of "1+1+1+...=491" style theorems. I'm surprised this method of random construction and transformation finds so many classical geometric theorems. Maybe because the domain is somewhat
@thtrieu_
trieu
7 months
Proud of this work. Here's my 22min video explanation of the paper:
39
156
775
4
5
50
@thomasahle
Thomas Ahle
2 years
@ellewoodsgolfs @ProfRobAnderson Of course it's a joke, but it makes a good point about a lot of faculty being severely underpaid. If universities don't want to pay lecturers, maybe students need to start tipping.
1
0
49
@thomasahle
Thomas Ahle
5 months
Everyone's so excited about RingAttention, but what happened to the other 100 or so "linear attention" papers that's come out since 2019?
@thomasahle
Thomas Ahle
5 months
@roydanroy Even in 2020 we already had: Sparse Transformers (Child et al., 2019), Reformer (Kitaev et al., 2020), Linformer (Wang et al., 2020), Longformer (Beltagy et al., 2020), Sinkhorn Transformers (Tay et al., 2020b), Performers (Choromanski et al., 2020b), Synthesizers (Tay et al.,
0
0
10
6
1
49
@thomasahle
Thomas Ahle
18 days
All of the AIMO top 4 solutions now have write-ups on Kaggle:
Tweet media one
2
5
48
@thomasahle
Thomas Ahle
3 years
The Fast Hadamard transform is one of those algorithms that are "so simple it must be optimal". A perfect example of a recursive algorithm. However, @firebat03 just showed that it can be improved using entirely non-simple techniques:
Tweet media one
4
10
47
@thomasahle
Thomas Ahle
7 months
Some more notes, the model is quite small, 12 layers & 1,024 dim. This should be encouraging for researchers. On the other hand, they don't just greedily decode the model, but use a beam search over 512 generations. Makes sense to extract the highest quality auxiliaries before
1
0
47
@thomasahle
Thomas Ahle
4 months
I'm fascinated by the idea of "bidirectional parsing": It's something like a combination of a templating language and a parser. If we could use this for #dspy , the LLM could optimize its own prompting templates, and parsing outputs would come for free.
Tweet media one
0
3
48
@thomasahle
Thomas Ahle
3 years
Here is another fun moment inequality to prove: It should hold for integer random variables X and n ≥ 1.
Tweet media one
7
4
48
@thomasahle
Thomas Ahle
1 month
There are so many great "Named Tensors" libraries. Why did none of them take off? Named Tensor: Tensor Shape Annotations: Axis Arrays: Pytorch's named tensors:
10
7
47
@thomasahle
Thomas Ahle
2 months
Fun meeting some fellow #DSPy heads, @CShorten30 and @thomastjoshi , at the Compound AI Systems Workshop!
Tweet media one
2
8
47
@thomasahle
Thomas Ahle
8 months
This is a wild difference.
Tweet media one
@AnthropicAI
Anthropic
8 months
Claude 2.1’s 200K token context window is powerful, but requires careful prompting to use effectively. Learn how to get Claude to recall an individual sentence across long documents with high fidelity:
Tweet media one
40
202
1K
0
1
47
@thomasahle
Thomas Ahle
9 months
Clustering the Sketch: Dynamic Compression for Embedding Tables Paper Website: ArXiv: Embedding tables are used by all machine learning systems that work with categorical data, like user IDs or word tokens. At Meta we had
Tweet media one
1
6
44
@thomasahle
Thomas Ahle
5 months
My real gripe with Needle In A Haystack is that there are much better things to do in San Francisco. 🌉🌄
0
2
43
@thomasahle
Thomas Ahle
1 month
Is it inconsistent how 𝚗𝚞𝚖𝚙𝚢.𝚍𝚒𝚊𝚐 behaves on vectors vs matrices? Using tensor-diagrams, it actually makes a lot of sense! And it's easy to generalize the behavior to arbitrary tensor sizes.
Tweet media one
1
3
43
@thomasahle
Thomas Ahle
3 years
"In math, when an author starts a sentence with 𝘤𝘭𝘦𝘢𝘳𝘭𝘺, what they are really saying is this seems clear to 𝘮𝘦, and I probably should have checked it, but I got a little confused, so I settle for just asserting that it was clear" - Clearly, @JSEllenberg is on to me...
Tweet media one
5
3
43
@thomasahle
Thomas Ahle
9 months
If you sample a random nxn matrix with IID entries from a normal distribution, you get that the condition number, κ, is roughly n: lim_{n→∞} Pr[κ/n < x] = exp(−2/x − 2/x²) For many other sub-gaussian distributions you seem to get similar CDFs, as shown by @AlanEdelmanMIT in
Tweet media one
2
5
41
@thomasahle
Thomas Ahle
1 year
LLama 2 uses Grouped Query Attention (Ainslie et al.) which has the benefit of allowing multi-GPU parallelism (one key/value per GPU) while still allowing more queries than key/values, which increases throughput. Better than my idea of just having all queries attend to all keys.
Tweet media one
3
9
41
@thomasahle
Thomas Ahle
1 year
I played around with some Sparse Recovery algorithms for to use for matrix recovery (this project: ) This is of course the famous idea by Terence Tao, Tropp and many others. Let me try to explain the 4 most common algorithms, and code them in Python. 👇
Tweet media one
@thomasahle
Thomas Ahle
2 years
A frequently overlooked fact is that KMeans is simply a matrix factorization algorithm X ≈ HM, where H is limited to 1 one per row. What if we allowed H to have 2 ones per row instead?👇
Tweet media one
12
40
302
1
7
41
@thomasahle
Thomas Ahle
9 months
New tool we made for visualizing thinking in LLMs, including Tree of Thoughs and Reflexion. Together these methods give state of the art code generation and general problem solving.
@ArunPatro
arun @ nyu
9 months
So, today, we introduce Branches, a tool for prototyping and visualizing such advanced LLM reasoning and planning algorithms.
2
24
75
1
6
41
@thomasahle
Thomas Ahle
5 months
Testing Ollama with the new DSPy typed API. Both classical and functional/tanuki style.
Tweet media one
Tweet media two
5
5
41