We’re taking the biggest bet in AI - a chip that can only run transformers, but does so orders of magnitude faster than GPUs. Maybe attention *is* all you need…
Meet Sohu, the fastest AI chip of all time.
With over 500,000 tokens per second running Llama 70B, Sohu lets you build products that are impossible on GPUs. One 8xSohu server replaces 160 H100s.
Sohu is the first specialized chip (ASIC) for transformer models. By specializing,
The 21st century will be the most important century ever for humanity, thanks to the rapid advances in artificial intelligence. It makes no sense to sit on the sidelines in university.
I'm excited to join the latest class of Thiel Fellows
Congrats to Cerebras on the impressive results!
How SRAM-only ASICs like it stack up against GPUs?
Spoiler: GPUs still rock for throughput, custom models, large models and prompts (common "prod" things). SRAM ASICs shine for pure generation.
Long 🧵
@cHHillee
Hey Horace! Big fan of your work on Pytorch. In the blog post on our website (), we outline exactly how our benchmark is calculated:
> Benchmarks are for Llama-3 70B in FP8 precision: no sparsity, 8x model parallel, 2048 input/128 output lengths
@cis_female
You could in theory stuff *way more* than 5x the FLOPS on the chip. It only takes around 10k transistors to build an FP16 fused-multiply-accumulate circuit that can run at 2 GHz (i.e. 4 GFLOPS)
Mistral just announced at
@SHACK15sf
that they will release a new model today:
Mistral 7B v0.2 Base Model
- 32k instead of 8k context window
- Rope Theta = 1e6
- No sliding window
@swyx
@felix_red_panda
@GroqInc
They split each layer on 8 chips, and Llama-2-70B has 64 layers, which gets you to 512 chips. The remaining rack of 64 is for other overhead (e.g. de-embedding).
@cHHillee
We use this benchmark (2048/128 sequences) because it's what NVIDIA uses in their comparisons to AMD: . Each of our chips is a reticle-sized 4nm die, and we put eight in each server. Would love to chat more in person!
@cis_female
you're not wrong - it is really hard (power consumption is also a huge issue). We're only able to make it work by being so specifically focused on transformers
Google open-sourced their Gemma models today, hyperparameters below. Both models have massive feed-forward hidden dimensions - almost every other model uses 3.5-4x the d_model (which would be 8192 and 12288). Not sure why the change.
Google claims that Gemini 1.5 can remember *10 million* tokens at once with near perfect accuracy. How is that even possible?I
If you read the paper closely, two figures give it away. A 🧵:
@_akhaliq
Interesting how Grok chose a widening factor of 8, instead of the more conventional 4. Between that and the large vocab, it’s a lot like Gemma.
@Tim_Dettmers
Don't take our word for it - here are NVIDIA's official benchmarks for the 8xH100 and 8xH200 running sequences of 2048 input / 128 output tokens
8xH100: 22,735 input + output tokens/sec
8xH200: 24,393 input + output tokens/sec
@AravSrinivas
tanh(x) is a classic sigmoid activation function. This is computing 30 tanh(x/30), a scaled version of it (that’s also a sigmoid). In theory, it makes softmax “weaker” and lets it choose more values if many keys match the query well.
@StabilityAI
Some notes:
- This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements.
- This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs..
- More technical details soon
@Tim_Dettmers
From NVIDIA’s TRT-LLM performance page:
> The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages), and shows the throughput client-server scenario under maximum load.
@fchollet
Strong disagree - I think LLM generation + symbolic checking will be enough to win the ARC-AGI challenge.
A few folks have already offered you some action , but I’d happily bet money on this
Particularly interesting is the fact they kept the same architecture as Llama-2, with the exception of the vocabulary. They looked into using an MoE model, but dense stuff just keeps winning.
@darrenangle
The biggest clue is the fact that the context size (limited by available memory) is different for audio, video, and images. This means they’re storing the inputs in memory and rerunning them.
@cis_female
(this is simplified a bit. Counting transistors isn't a great proxy since certain parts of an H100 like the SRAM memories are denser, so area numbers should be thought of in NAND gate equivalents. The 10k transistors per FMA is amortized since multi-input FMA blocks are denser)
@Suhail
@finbarrtimbers
@Etched
@MatXComputing
50% cheaper is probably enough to grind down NVIDIA's margins. See AMD's MI300X - because of competition, NVIDIA barely raised the price of the B100/B200 even though the BOM cost is much higher.
@teortaxesTex
This bet lets you remove a ton of control logic, but it also lets you remove a *ton* of SRAMs from the chip. A transformer has no temporal or spatial locality for the weights (each is used exactly once per token), so there is no point in having bulky L0 or L1 caches, for example.
@cis_female
However, any flexible/reprogrammable AI accelerator isn't able to "just add more FLOPS" since you need a *lot* of circuitry inside the tensor cores next to the FMA blocks to keep them fed.
@ai_for_success
The blog post doesn’t take into account the way the size of the video frame affects the amount of compute needed. See the paper by Dehghani et al. Even if the assumptions in the blog post were right, the amount of compute needed is 2-10x more.
@davisblalock
@Tim_Dettmers
Yep, from that page. The 1441 tokens/sec for FP8 Llama is the number of output tokens. We follow the convention of reporting input + output tokens. To convert, you multiply by (2048 + 128)/128 = 17
@cis_female
(this is simplified. Counting transistors isn't really a good proxy since certain parts of an H100 like the SRAM memories are denser, so area numbers should be thought of in NAND gate equivalents. The 10k transistors per FMA is amortized since multi-input FMA blocks are denser)
@appenz
@Tim_Dettmers
The average OpenOrca prompt (system_prompt + question) is under 300 tokens, which means attention for this benchmark requires much less memory bandwidth than for 2048 input/128 output.
@jtvhk
@Suhail
@finbarrtimbers
@Etched
Being only for transformers gets you a lot more than that! It makes the market narrower, but you have to make some trade-off to get >10x.
@danielhanchen
@grok
This is a different kind of trick than exp(x - max(x)) as the later is mathematically equivalent, but the tanh trick isn’t. It seems more intended to “soften” softmax even further than to increase numerical stability.
@ai_for_success
This is completely fake. The 5 minutes of video per hour per H100 is a total guess, made by arbitrarily assuming Sora is 30x larger than DiT. It also ignores the fact that the videos generated by Sora aren’t 512px by 512px.
@appenz
@Tim_Dettmers
Good question! That figure is a different benchmark (OpenOrca responses, which have ~300 token prompts), and is being run on a different system (their custom thermal solution (CTS), where they replace the heatsinks to enable the GPU to run a little bit faster).
(We know Gemini 1.5 runs on TPU V4s, which have 128x128 BF16 matrix multipliers. This lets us reasonably assume they’re running MQA with d_head = 128 in BF16, just like they explicitly state in their PaLM 540B paper)
We have to guess the number of layers in Gemini 1.5, but we don’t have to guess anything else. Let’s assume it’s 80, because the previous public Deepmind models (Gopher and Chinchilla) used 80 layers.
Since KV vectors are re-used across queries, the amount of memory taken up by 10M tokens of KV cache is:
10 million * 80 layers * 128 values per head * 2 bytes per value = 204 GB of KV cache
Since we need to leave room for part of the weights too, this is about what we'd expect
While these three facts don’t prove anything, they do provide good evidence that the input data and the KV cache share the same memory, which is good evidence that Gemini 1.5 achieves its impressive recall by being able to “look again” at relevant parts of the input.
Their paper claims 2.8M tokens for images/video, 2M tokens for audio, and 10M tokens for text.
Given this, I bet Gemini 1.5 has standard multi-query attention, coupled with the ability to “look again” at tokens in its context window.
Gemini 1.5’s paper says it is a “mixture-of-expert (MoE) Transformer-based model”. To generate a token with a Transformer, you don’t need the previous input values or the model’s internal state while running them. But you do need to keep the key and value vectors (the KV cache).
What about the audio? The paper says that “Gemini can directly ingest audio signals at 16kHz from Universal Speech Model (USM)”, and USM’s architecture is public (TL;DR it has 128 Mel frequency bins).
@teortaxesTex
We’re focusing exclusively on transformers (though I think multimodal and video models especially are great use cases). This is scary! Transformers could go away, and we are betting they will not.