Gavin Uberti Profile
Gavin Uberti

@UbertiGavin

3,108
Followers
190
Following
7
Media
105
Statuses

Building model-specific AI chips @ Etched

Menlo Park, CA
Joined March 2022
Don't wanna be here? Send us removal request.
@UbertiGavin
Gavin Uberti
4 months
We’re taking the biggest bet in AI - a chip that can only run transformers, but does so orders of magnitude faster than GPUs. Maybe attention *is* all you need…
@Etched
Etched
4 months
Meet Sohu, the fastest AI chip of all time. With over 500,000 tokens per second running Llama 70B, Sohu lets you build products that are impossible on GPUs. One 8xSohu server replaces 160 H100s. Sohu is the first specialized chip (ASIC) for transformer models. By specializing,
Tweet media one
342
1K
6K
42
38
639
@UbertiGavin
Gavin Uberti
20 days
It’s only reasoning if it’s from the Reasonique region of the human brain. Otherwise it’s just a sparkling stochastic parrot.
@aidan_mclau
Aidan McLau
20 days
an efficient market would see apple's stock down 10% after publishing this
Tweet media one
62
19
681
8
3
80
@UbertiGavin
Gavin Uberti
7 months
The 21st century will be the most important century ever for humanity, thanks to the rapid advances in artificial intelligence. It makes no sense to sit on the sidelines in university. I'm excited to join the latest class of Thiel Fellows
@thielfellowship
Thiel Fellowship
7 months
Welcome to the Fellowship!
35
41
664
8
1
68
@UbertiGavin
Gavin Uberti
2 months
Great thread walking through some principles of running LLMs on AI chips, written by an expert in the space
@dzhulgakov
Dmytro Dzhulgakov
2 months
Congrats to Cerebras on the impressive results! How SRAM-only ASICs like it stack up against GPUs? Spoiler: GPUs still rock for throughput, custom models, large models and prompts (common "prod" things). SRAM ASICs shine for pure generation. Long 🧵
3
14
125
3
3
29
@UbertiGavin
Gavin Uberti
4 months
@cHHillee Hey Horace! Big fan of your work on Pytorch. In the blog post on our website (), we outline exactly how our benchmark is calculated: > Benchmarks are for Llama-3 70B in FP8 precision: no sparsity, 8x model parallel, 2048 input/128 output lengths
3
0
27
@UbertiGavin
Gavin Uberti
7 months
Llama-3 400B is a dense model. From the Dwarkesh podcast with Zuckerberg:
Tweet media one
1
5
19
@UbertiGavin
Gavin Uberti
4 months
@cis_female You could in theory stuff *way more* than 5x the FLOPS on the chip. It only takes around 10k transistors to build an FP16 fused-multiply-accumulate circuit that can run at 2 GHz (i.e. 4 GFLOPS)
2
0
15
@UbertiGavin
Gavin Uberti
7 months
Really glad to see folks finally changing the RoPE hyperparams
@marvinvonhagen
Marvin von Hagen
7 months
Mistral just announced at @SHACK15sf that they will release a new model today: Mistral 7B v0.2 Base Model - 32k instead of 8k context window - Rope Theta = 1e6 - No sliding window
Tweet media one
26
126
785
0
0
12
@UbertiGavin
Gavin Uberti
9 months
@swyx @felix_red_panda @GroqInc They split each layer on 8 chips, and Llama-2-70B has 64 layers, which gets you to 512 chips. The remaining rack of 64 is for other overhead (e.g. de-embedding).
1
0
13
@UbertiGavin
Gavin Uberti
22 days
Shoutout to @VictorStinner and Mark Dickinson for adding the fma function into Python 3.13, which is really helpful for my use cases. Thank you!
1
0
10
@UbertiGavin
Gavin Uberti
4 months
@cHHillee We use this benchmark (2048/128 sequences) because it's what NVIDIA uses in their comparisons to AMD: . Each of our chips is a reticle-sized 4nm die, and we put eight in each server. Would love to chat more in person!
3
0
8
@UbertiGavin
Gavin Uberti
4 months
@cis_female you're not wrong - it is really hard (power consumption is also a huge issue). We're only able to make it work by being so specifically focused on transformers
0
1
8
@UbertiGavin
Gavin Uberti
7 months
#LLaMA3 looks awesome. Whether the 400B model be better than GPT-4 and Opus isn't clear, but either way these are very impressive results.
Tweet media one
1
1
8
@UbertiGavin
Gavin Uberti
8 months
Google open-sourced their Gemma models today, hyperparameters below. Both models have massive feed-forward hidden dimensions - almost every other model uses 3.5-4x the d_model (which would be 8192 and 12288). Not sure why the change.
Tweet media one
0
0
6
@UbertiGavin
Gavin Uberti
9 months
Google claims that Gemini 1.5 can remember *10 million* tokens at once with near perfect accuracy. How is that even possible?I If you read the paper closely, two figures give it away. A 🧵:
Tweet media one
1
0
6
@UbertiGavin
Gavin Uberti
8 months
@_akhaliq Interesting how Grok chose a widening factor of 8, instead of the more conventional 4. Between that and the large vocab, it’s a lot like Gemma.
0
0
5
@UbertiGavin
Gavin Uberti
4 months
@Tim_Dettmers Don't take our word for it - here are NVIDIA's official benchmarks for the 8xH100 and 8xH200 running sequences of 2048 input / 128 output tokens 8xH100: 22,735 input + output tokens/sec 8xH200: 24,393 input + output tokens/sec
4
0
6
@UbertiGavin
Gavin Uberti
8 months
@AravSrinivas tanh(x) is a classic sigmoid activation function. This is computing 30 tanh(x/30), a scaled version of it (that’s also a sigmoid). In theory, it makes softmax “weaker” and lets it choose more values if many keys match the query well.
1
0
6
@UbertiGavin
Gavin Uberti
8 months
Stable Diffusion 3 is a transformer 👀
@EMostaque
Emad
8 months
@StabilityAI Some notes: - This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements. - This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs.. - More technical details soon
30
45
667
0
0
5
@UbertiGavin
Gavin Uberti
4 months
@Tim_Dettmers From NVIDIA’s TRT-LLM performance page: > The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages), and shows the throughput client-server scenario under maximum load.
1
0
5
@UbertiGavin
Gavin Uberti
28 days
@sytelus Transformers seem to be the first capitalist model architecture - throwing more money/compute at them keeps making them better
1
0
5
@UbertiGavin
Gavin Uberti
4 months
@cHHillee NVIDIA gives the same benchmark, apples-to-apples, on their TRT-LLM performance page ()
1
0
5
@UbertiGavin
Gavin Uberti
4 months
@imbue_ai This is a great write-up. So much of this knowledge is private - connecting 500 servers together is a real feat of engineering!
0
0
5
@UbertiGavin
Gavin Uberti
5 months
@fchollet Strong disagree - I think LLM generation + symbolic checking will be enough to win the ARC-AGI challenge. A few folks have already offered you some action , but I’d happily bet money on this
0
0
4
@UbertiGavin
Gavin Uberti
4 months
@Tim_Dettmers Source: We chose 2048 input / 128 output because it's the metric NVIDIA uses when comparing to AMD.
2
0
4
@UbertiGavin
Gavin Uberti
7 months
Particularly interesting is the fact they kept the same architecture as Llama-2, with the exception of the vocabulary. They looked into using an MoE model, but dense stuff just keeps winning.
0
2
4
@UbertiGavin
Gavin Uberti
9 months
@darrenangle The biggest clue is the fact that the context size (limited by available memory) is different for audio, video, and images. This means they’re storing the inputs in memory and rerunning them.
0
0
2
@UbertiGavin
Gavin Uberti
4 months
@cis_female (this is simplified a bit. Counting transistors isn't a great proxy since certain parts of an H100 like the SRAM memories are denser, so area numbers should be thought of in NAND gate equivalents. The 10k transistors per FMA is amortized since multi-input FMA blocks are denser)
1
1
3
@UbertiGavin
Gavin Uberti
5 months
@Suhail @finbarrtimbers @Etched @MatXComputing 50% cheaper is probably enough to grind down NVIDIA's margins. See AMD's MI300X - because of competition, NVIDIA barely raised the price of the B100/B200 even though the BOM cost is much higher.
1
0
3
@UbertiGavin
Gavin Uberti
4 months
@Tim_Dettmers Like the “offline” category in MLPerf. Our figures are under the exact same conditions.
1
0
3
@UbertiGavin
Gavin Uberti
7 months
@yacineMTB dense is all you need
0
0
2
@UbertiGavin
Gavin Uberti
3 months
0
0
3
@UbertiGavin
Gavin Uberti
4 months
@teortaxesTex This bet lets you remove a ton of control logic, but it also lets you remove a *ton* of SRAMs from the chip. A transformer has no temporal or spatial locality for the weights (each is used exactly once per token), so there is no point in having bulky L0 or L1 caches, for example.
0
0
3
@UbertiGavin
Gavin Uberti
7 months
@bonniesjli @czhu1729 Triple Fellows! Congrats to Rob as well
1
0
3
@UbertiGavin
Gavin Uberti
2 months
@deepfates The recipe for AGI is simple
0
0
3
@UbertiGavin
Gavin Uberti
5 months
@juberti Congratulations!
0
0
3
@UbertiGavin
Gavin Uberti
4 months
@cis_female However, any flexible/reprogrammable AI accelerator isn't able to "just add more FLOPS" since you need a *lot* of circuitry inside the tensor cores next to the FMA blocks to keep them fed.
1
0
2
@UbertiGavin
Gavin Uberti
4 months
@ComputingByArts Just for transformers, and just for inference. Sohu is almost as specialized as it gets.
1
0
2
@UbertiGavin
Gavin Uberti
7 months
@ai_for_success The blog post doesn’t take into account the way the size of the video frame affects the amount of compute needed. See the paper by Dehghani et al. Even if the assumptions in the blog post were right, the amount of compute needed is 2-10x more.
1
0
2
@UbertiGavin
Gavin Uberti
4 months
@davisblalock @Tim_Dettmers Yep, from that page. The 1441 tokens/sec for FP8 Llama is the number of output tokens. We follow the convention of reporting input + output tokens. To convert, you multiply by (2048 + 128)/128 = 17
0
0
2
@UbertiGavin
Gavin Uberti
5 months
@Suhail @finbarrtimbers @Etched @MatXComputing But to actually win the market, any company/startup must be 10x faster imo. There aren't many ways to achieve that
1
0
2
@UbertiGavin
Gavin Uberti
4 months
@cis_female (this is simplified. Counting transistors isn't really a good proxy since certain parts of an H100 like the SRAM memories are denser, so area numbers should be thought of in NAND gate equivalents. The 10k transistors per FMA is amortized since multi-input FMA blocks are denser)
0
0
2
@UbertiGavin
Gavin Uberti
4 months
@appenz @Tim_Dettmers The average OpenOrca prompt (system_prompt + question) is under 300 tokens, which means attention for this benchmark requires much less memory bandwidth than for 2048 input/128 output.
0
0
2
@UbertiGavin
Gavin Uberti
4 months
@Tim_Dettmers Big fan of your work by the way! Would love to chat sometime
0
0
2
@UbertiGavin
Gavin Uberti
5 months
@jtvhk @Suhail @finbarrtimbers @Etched Being only for transformers gets you a lot more than that! It makes the market narrower, but you have to make some trade-off to get >10x.
0
0
2
@UbertiGavin
Gavin Uberti
1 year
@WelcomeAI_ Exciting times ahead!
0
1
2
@UbertiGavin
Gavin Uberti
8 months
@danielhanchen @grok This is a different kind of trick than exp(x - max(x)) as the later is mathematically equivalent, but the tanh trick isn’t. It seems more intended to “soften” softmax even further than to increase numerical stability.
1
0
1
@UbertiGavin
Gavin Uberti
7 months
@mercor_ai Great article!
0
0
2
@UbertiGavin
Gavin Uberti
7 months
@ai_for_success This is completely fake. The 5 minutes of video per hour per H100 is a total guess, made by arbitrarily assuming Sora is 30x larger than DiT. It also ignores the fact that the videos generated by Sora aren’t 512px by 512px.
1
0
2
@UbertiGavin
Gavin Uberti
4 months
@cis_female Check the math yourself - since there are 80 billion transistors on an H100, only a tiny fraction are for compute.
1
0
2
@UbertiGavin
Gavin Uberti
4 months
@appenz @Tim_Dettmers Good question! That figure is a different benchmark (OpenOrca responses, which have ~300 token prompts), and is being run on a different system (their custom thermal solution (CTS), where they replace the heatsinks to enable the GPU to run a little bit faster).
1
0
2
@UbertiGavin
Gavin Uberti
4 months
0
0
2
@UbertiGavin
Gavin Uberti
9 months
(We know Gemini 1.5 runs on TPU V4s, which have 128x128 BF16 matrix multipliers. This lets us reasonably assume they’re running MQA with d_head = 128 in BF16, just like they explicitly state in their PaLM 540B paper)
1
0
2
@UbertiGavin
Gavin Uberti
9 months
We have to guess the number of layers in Gemini 1.5, but we don’t have to guess anything else. Let’s assume it’s 80, because the previous public Deepmind models (Gopher and Chinchilla) used 80 layers.
1
0
1
@UbertiGavin
Gavin Uberti
4 months
@cis_female But this still gives some intuition for why this is possible
0
0
1
@UbertiGavin
Gavin Uberti
9 months
Since KV vectors are re-used across queries, the amount of memory taken up by 10M tokens of KV cache is: 10 million * 80 layers * 128 values per head * 2 bytes per value = 204 GB of KV cache Since we need to leave room for part of the weights too, this is about what we'd expect
1
0
1
@UbertiGavin
Gavin Uberti
9 months
While these three facts don’t prove anything, they do provide good evidence that the input data and the KV cache share the same memory, which is good evidence that Gemini 1.5 achieves its impressive recall by being able to “look again” at relevant parts of the input.
1
0
1
@UbertiGavin
Gavin Uberti
22 days
@dhuang26 @VictorStinner It’s great! If only @numpy_team added the same thing to their stuff
0
0
1
@UbertiGavin
Gavin Uberti
9 months
Their paper claims 2.8M tokens for images/video, 2M tokens for audio, and 10M tokens for text. Given this, I bet Gemini 1.5 has standard multi-query attention, coupled with the ability to “look again” at tokens in its context window.
Tweet media one
1
0
1
@UbertiGavin
Gavin Uberti
7 months
@danielhanchen @databricks @OpenAI What do you mean by mean removal?
1
0
1
@UbertiGavin
Gavin Uberti
9 months
Gemini 1.5’s paper says it is a “mixture-of-expert (MoE) Transformer-based model”. To generate a token with a Transformer, you don’t need the previous input values or the model’s internal state while running them. But you do need to keep the key and value vectors (the KV cache).
1
0
1
@UbertiGavin
Gavin Uberti
7 months
1
0
1
@UbertiGavin
Gavin Uberti
9 months
What about the audio? The paper says that “Gemini can directly ingest audio signals at 16kHz from Universal Speech Model (USM)”, and USM’s architecture is public (TL;DR it has 128 Mel frequency bins).
1
0
1
@UbertiGavin
Gavin Uberti
1 year
0
0
1
@UbertiGavin
Gavin Uberti
4 months
@teortaxesTex We’re focusing exclusively on transformers (though I think multimodal and video models especially are great use cases). This is scary! Transformers could go away, and we are betting they will not.
1
0
1