Gavin Uberti @UbertiGavin Twitter profile

Last Seen Profiles

@MenilSavla

@PunishedQuartz

@BoissyPenta

@BoobieTrap_YY

@kameel_mistress

@iro_16p

@tsuu_art

@AmarteurLover

@Gazelight

@Kontol_GayINDO

@cheekyredhead93

@NKynan10481

@fexo_token

@bota82937270041

@frankOleg2

@Kontol_GayINDO

@norabiyytah

@RachaelBar38692

@akai1ro

@Stella_sssss

@cukienaknikmati

@Emaodiaa

@lianxxc

@nctzen_dream_ru

@BinorRaja

@BridgetteD86725

@kldlbkry16

@angham_voice

@AktoprakArt

@Moustafaf4

@Chooseandwork

@dreibund___

@thinkr_xin

@Jemima9711

@sinkaikanata77

@llsdaqh

Gavin Uberti

@UbertiGavin

4 months

We’re taking the biggest bet in AI - a chip that can only run transformers, but does so orders of magnitude faster than GPUs. Maybe attention *is* all you need…

Etched

@Etched

4 months

Meet Sohu, the fastest AI chip of all time. With over 500,000 tokens per second running Llama 70B, Sohu lets you build products that are impossible on GPUs. One 8xSohu server replaces 160 H100s. Sohu is the first specialized chip (ASIC) for transformer models. By specializing,

342

1K

6K

42

38

639

Gavin Uberti

@UbertiGavin

20 days

It’s only reasoning if it’s from the Reasonique region of the human brain. Otherwise it’s just a sparkling stochastic parrot.

Aidan McLau

@aidan_mclau

20 days

an efficient market would see apple's stock down 10% after publishing this

62

19

681

8

3

80

Gavin Uberti

@UbertiGavin

7 months

The 21st century will be the most important century ever for humanity, thanks to the rapid advances in artificial intelligence. It makes no sense to sit on the sidelines in university. I'm excited to join the latest class of Thiel Fellows

Thiel Fellowship

@thielfellowship

7 months

Welcome to the Fellowship!

35

41

664

8

1

68

Gavin Uberti

@UbertiGavin

2 months

Great thread walking through some principles of running LLMs on AI chips, written by an expert in the space

Dmytro Dzhulgakov

@dzhulgakov

2 months

Congrats to Cerebras on the impressive results! How SRAM-only ASICs like it stack up against GPUs? Spoiler: GPUs still rock for throughput, custom models, large models and prompts (common "prod" things). SRAM ASICs shine for pure generation. Long 🧵

3

14

125

3

29

Gavin Uberti

@UbertiGavin

4 months

@cHHillee Hey Horace! Big fan of your work on Pytorch. In the blog post on our website (), we outline exactly how our benchmark is calculated: > Benchmarks are for Llama-3 70B in FP8 precision: no sparsity, 8x model parallel, 2048 input/128 output lengths

3

0

27

Gavin Uberti

@UbertiGavin

7 months

Llama-3 400B is a dense model. From the Dwarkesh podcast with Zuckerberg:

1

5

19

Gavin Uberti

@UbertiGavin

4 months

@cis_female You could in theory stuff *way more* than 5x the FLOPS on the chip. It only takes around 10k transistors to build an FP16 fused-multiply-accumulate circuit that can run at 2 GHz (i.e. 4 GFLOPS)

2

0

15

Gavin Uberti

@UbertiGavin

7 months

Really glad to see folks finally changing the RoPE hyperparams

Marvin von Hagen

@marvinvonhagen

7 months

Mistral just announced at @SHACK15sf that they will release a new model today: Mistral 7B v0.2 Base Model - 32k instead of 8k context window - Rope Theta = 1e6 - No sliding window

26

126

785

0

12

Gavin Uberti

@UbertiGavin

9 months

@swyx @felix_red_panda @GroqInc They split each layer on 8 chips, and Llama-2-70B has 64 layers, which gets you to 512 chips. The remaining rack of 64 is for other overhead (e.g. de-embedding).

1

0

13

Gavin Uberti

@UbertiGavin

22 days

Shoutout to @VictorStinner and Mark Dickinson for adding the fma function into Python 3.13, which is really helpful for my use cases. Thank you!

1

0

10

Gavin Uberti

@UbertiGavin

4 months

@cHHillee We use this benchmark (2048/128 sequences) because it's what NVIDIA uses in their comparisons to AMD: . Each of our chips is a reticle-sized 4nm die, and we put eight in each server. Would love to chat more in person!

3

0

8

Gavin Uberti

@UbertiGavin

4 months

@cis_female you're not wrong - it is really hard (power consumption is also a huge issue). We're only able to make it work by being so specifically focused on transformers

0

1

8

Gavin Uberti

@UbertiGavin

7 months

#LLaMA3 looks awesome. Whether the 400B model be better than GPT-4 and Opus isn't clear, but either way these are very impressive results.

1

8

Gavin Uberti

@UbertiGavin

8 months

Google open-sourced their Gemma models today, hyperparameters below. Both models have massive feed-forward hidden dimensions - almost every other model uses 3.5-4x the d_model (which would be 8192 and 12288). Not sure why the change.

0

6

Gavin Uberti

@UbertiGavin

9 months

Google claims that Gemini 1.5 can remember *10 million* tokens at once with near perfect accuracy. How is that even possible?I If you read the paper closely, two figures give it away. A 🧵:

1

0

6

Gavin Uberti

@UbertiGavin

8 months

@_akhaliq Interesting how Grok chose a widening factor of 8, instead of the more conventional 4. Between that and the large vocab, it’s a lot like Gemma.

0

5

Gavin Uberti

@UbertiGavin

4 months

@Tim_Dettmers Don't take our word for it - here are NVIDIA's official benchmarks for the 8xH100 and 8xH200 running sequences of 2048 input / 128 output tokens 8xH100: 22,735 input + output tokens/sec 8xH200: 24,393 input + output tokens/sec

4

0

6

Gavin Uberti

@UbertiGavin

8 months

@AravSrinivas tanh(x) is a classic sigmoid activation function. This is computing 30 tanh(x/30), a scaled version of it (that’s also a sigmoid). In theory, it makes softmax “weaker” and lets it choose more values if many keys match the query well.

1

0

6

Gavin Uberti

@UbertiGavin

8 months

Stable Diffusion 3 is a transformer 👀

Emad

@EMostaque

8 months

@StabilityAI Some notes: - This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements. - This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs.. - More technical details soon

30

45

667

0

5

Gavin Uberti

@UbertiGavin

4 months

@Tim_Dettmers From NVIDIA’s TRT-LLM performance page: > The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages), and shows the throughput client-server scenario under maximum load.

1

0

5

Gavin Uberti

@UbertiGavin

28 days

@sytelus Transformers seem to be the first capitalist model architecture - throwing more money/compute at them keeps making them better

1

0

5

Gavin Uberti

@UbertiGavin

4 months

@cHHillee NVIDIA gives the same benchmark, apples-to-apples, on their TRT-LLM performance page ()

1

0

5

Gavin Uberti

@UbertiGavin

8 months

Black box access is all you need -

Stealing Part of a Production Language Model

We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our...

arxiv.org

0

5

Gavin Uberti

@UbertiGavin

4 months

@imbue_ai This is a great write-up. So much of this knowledge is private - connecting 500 servers together is a real feat of engineering!

0

5

Gavin Uberti

@UbertiGavin

5 months

@fchollet Strong disagree - I think LLM generation + symbolic checking will be enough to win the ARC-AGI challenge. A few folks have already offered you some action , but I’d happily bet money on this

0

4

Gavin Uberti

@UbertiGavin

4 months

@Tim_Dettmers Source: We chose 2048 input / 128 output because it's the metric NVIDIA uses when comparing to AMD.

2

0

4

Gavin Uberti

@UbertiGavin

7 months

Particularly interesting is the fact they kept the same architecture as Llama-2, with the exception of the vocabulary. They looked into using an MoE model, but dense stuff just keeps winning.

0

2

4

Gavin Uberti

@UbertiGavin

1 month

@BrendanFoody @mercor_ai @victoralazarte @bgurley @benchmark @peterthiel @jack @adamdangelo @LHSummers Congratulations to @mercor_ai ! Three of the best co-founders I know

0

3

Gavin Uberti

@UbertiGavin

9 months

@darrenangle The biggest clue is the fact that the context size (limited by available memory) is different for audio, video, and images. This means they’re storing the inputs in memory and rerunning them.

0

2

Gavin Uberti

@UbertiGavin

4 months

@cis_female (this is simplified a bit. Counting transistors isn't a great proxy since certain parts of an H100 like the SRAM memories are denser, so area numbers should be thought of in NAND gate equivalents. The 10k transistors per FMA is amortized since multi-input FMA blocks are denser)

1

3

Gavin Uberti

@UbertiGavin

5 months

@Suhail @finbarrtimbers @Etched @MatXComputing 50% cheaper is probably enough to grind down NVIDIA's margins. See AMD's MI300X - because of competition, NVIDIA barely raised the price of the B100/B200 even though the BOM cost is much higher.

1

0

3

Gavin Uberti

@UbertiGavin

4 months

@Tim_Dettmers Like the “offline” category in MLPerf. Our figures are under the exact same conditions.

1

0

3

Gavin Uberti

@UbertiGavin

7 months

@yacineMTB dense is all you need

0

2

Gavin Uberti

@UbertiGavin

3 months

@0xSigil @O1Visa @SteveMaggilaw Congratulations Sigil!

0

3

Gavin Uberti

@UbertiGavin

4 months

@teortaxesTex This bet lets you remove a ton of control logic, but it also lets you remove a *ton* of SRAMs from the chip. A transformer has no temporal or spatial locality for the weights (each is used exactly once per token), so there is no point in having bulky L0 or L1 caches, for example.

0

3

Gavin Uberti

@UbertiGavin

7 months

@bonniesjli @czhu1729 Triple Fellows! Congrats to Rob as well

1

0

3

Gavin Uberti

@UbertiGavin

2 months

@deepfates The recipe for AGI is simple

0

3

Gavin Uberti

@UbertiGavin

5 months

@juberti Congratulations!

0

3

Gavin Uberti

@UbertiGavin

4 months

@cis_female However, any flexible/reprogrammable AI accelerator isn't able to "just add more FLOPS" since you need a *lot* of circuitry inside the tensor cores next to the FMA blocks to keep them fed.

1

0

2

Gavin Uberti

@UbertiGavin

4 months

@ComputingByArts Just for transformers, and just for inference. Sohu is almost as specialized as it gets.

1

0

2

Gavin Uberti

@UbertiGavin

7 months

@ai_for_success The blog post doesn’t take into account the way the size of the video frame affects the amount of compute needed. See the paper by Dehghani et al. Even if the assumptions in the blog post were right, the amount of compute needed is 2-10x more.

1

0

2

Gavin Uberti

@UbertiGavin

4 months

@davisblalock @Tim_Dettmers Yep, from that page. The 1441 tokens/sec for FP8 Llama is the number of output tokens. We follow the convention of reporting input + output tokens. To convert, you multiply by (2048 + 128)/128 = 17

0

2

Gavin Uberti

@UbertiGavin

5 months

@Suhail @finbarrtimbers @Etched @MatXComputing But to actually win the market, any company/startup must be 10x faster imo. There aren't many ways to achieve that

1

0

2

Gavin Uberti

@UbertiGavin

4 months

@cis_female (this is simplified. Counting transistors isn't really a good proxy since certain parts of an H100 like the SRAM memories are denser, so area numbers should be thought of in NAND gate equivalents. The 10k transistors per FMA is amortized since multi-input FMA blocks are denser)

0

2

Gavin Uberti

@UbertiGavin

4 months

@appenz @Tim_Dettmers The average OpenOrca prompt (system_prompt + question) is under 300 tokens, which means attention for this benchmark requires much less memory bandwidth than for 2048 input/128 output.

0

2

Gavin Uberti

@UbertiGavin

4 months

@Tim_Dettmers Big fan of your work by the way! Would love to chat sometime

0

2

Gavin Uberti

@UbertiGavin

5 months

@jtvhk @Suhail @finbarrtimbers @Etched Being only for transformers gets you a lot more than that! It makes the market narrower, but you have to make some trade-off to get >10x.

0

2

Gavin Uberti

@UbertiGavin

1 year

@WelcomeAI_ Exciting times ahead!

0

1

2

Gavin Uberti

@UbertiGavin

8 months

@danielhanchen @grok This is a different kind of trick than exp(x - max(x)) as the later is mathematically equivalent, but the tanh trick isn’t. It seems more intended to “soften” softmax even further than to increase numerical stability.

1

0

1

Gavin Uberti

@UbertiGavin

7 months

@mercor_ai Great article!

0

2

Gavin Uberti

@UbertiGavin

7 months

@ai_for_success This is completely fake. The 5 minutes of video per hour per H100 is a total guess, made by arbitrarily assuming Sora is 30x larger than DiT. It also ignores the fact that the videos generated by Sora aren’t 512px by 512px.

1

0

2

Gavin Uberti

@UbertiGavin

4 months

@cis_female Check the math yourself - since there are 80 billion transistors on an H100, only a tiny fraction are for compute.

1

0

2

Gavin Uberti

@UbertiGavin

4 months

@appenz @Tim_Dettmers Good question! That figure is a different benchmark (OpenOrca responses, which have ~300 token prompts), and is being run on a different system (their custom thermal solution (CTS), where they replace the heatsinks to enable the GPU to run a little bit faster).

1

0

2

Gavin Uberti

@UbertiGavin

4 months

@GavinSBaker @patrick_oshag Thanks Gavin!

0

2

Gavin Uberti

@UbertiGavin

9 months

(We know Gemini 1.5 runs on TPU V4s, which have 128x128 BF16 matrix multipliers. This lets us reasonably assume they’re running MQA with d_head = 128 in BF16, just like they explicitly state in their PaLM 540B paper)

1

0

2

Gavin Uberti

@UbertiGavin

9 months

We have to guess the number of layers in Gemini 1.5, but we don’t have to guess anything else. Let’s assume it’s 80, because the previous public Deepmind models (Gopher and Chinchilla) used 80 layers.

1

0

1

Gavin Uberti

@UbertiGavin

4 months

@cis_female But this still gives some intuition for why this is possible

0

1

Gavin Uberti

@UbertiGavin

9 months

Since KV vectors are re-used across queries, the amount of memory taken up by 10M tokens of KV cache is: 10 million * 80 layers * 128 values per head * 2 bytes per value = 204 GB of KV cache Since we need to leave room for part of the weights too, this is about what we'd expect

1

0

1

Gavin Uberti

@UbertiGavin

9 months

While these three facts don’t prove anything, they do provide good evidence that the input data and the KV cache share the same memory, which is good evidence that Gemini 1.5 achieves its impressive recall by being able to “look again” at relevant parts of the input.

1

0

1

Gavin Uberti

@UbertiGavin

22 days

@dhuang26 @VictorStinner It’s great! If only @numpy_team added the same thing to their stuff

0

1

Gavin Uberti

@UbertiGavin

9 months

Their paper claims 2.8M tokens for images/video, 2M tokens for audio, and 10M tokens for text. Given this, I bet Gemini 1.5 has standard multi-query attention, coupled with the ability to “look again” at tokens in its context window.

1

0

1

Gavin Uberti

@UbertiGavin

7 months

@danielhanchen @databricks @OpenAI What do you mean by mean removal?

1

0

1

Gavin Uberti

@UbertiGavin

9 months

Gemini 1.5’s paper says it is a “mixture-of-expert (MoE) Transformer-based model”. To generate a token with a Transformer, you don’t need the previous input values or the model’s internal state while running them. But you do need to keep the key and value vectors (the KV cache).

1

0

1

Gavin Uberti

@UbertiGavin

7 months

@walden_yan @thielfellowship Congrats Walden!

1

0

1

Gavin Uberti

@UbertiGavin

9 months

What about the audio? The paper says that “Gemini can directly ingest audio signals at 16kHz from Universal Speech Model (USM)”, and USM’s architecture is public (TL;DR it has 128 Mel frequency bins).

1

0

1

Gavin Uberti

@UbertiGavin

1 year

@bonniesjli @GoogleDeepMind Congratulations!

0

1

Gavin Uberti

@UbertiGavin

4 months

@teortaxesTex We’re focusing exclusively on transformers (though I think multimodal and video models especially are great use cases). This is scary! Transformers could go away, and we are betting they will not.

1

0

1