Iโm taking a sabbatical through end of year. First time to relax since co-founding OpenAI 9 years ago. The mission is far from complete; we still have a safe AGI to build.
A 7B parameter LLM consumes 0.7 J/token. A fully charged iPhone, with ~50kJ of energy, can sustain this LLM in conversation for <2 hours at a rate of 10 tok/s, with every 64 tokens draining 0.2% of the battery.
๐๏ธ Introducing Whisper Turbo ๐๏ธ
Engineered from scratch in Rust + WebGPU. Transcribe 20x faster than realtime - all in the browser!
Here's why it's a game-changer:
The web is the best platform for building fun tools.
- FFMPEG compiled to WASM can convert ~any input format to WAV for Whisper.
- Ratchet runs super fast Whisper entirely on the GPU to generate the transcript.
- Runs on any hardware, offline, no install.
๐จ whisper-turbo v0.8 release ๐จ
What's new?
All models have slimmed down ๐๏ธ
Tiny: 150M -> 51M
Base: 280M -> 97M
Small: 650M -> 310M
Medium: 1.75G -> 970M
Memory leaks == history ๐ชฃ
You can now transcribe *very* long samples.
Check it out on GH:
Excited to announce that I have joined Hugging Face ๐ค
Looking forward to bringing some of the work I've done over the past 6 months to many more devices!
๐จ Phi-3 running in the browser ๐จ
Hits about 20 tok/s ๐๏ธ Literally 3 lines of JS.
Still some kinks to iron out, coming to Ratchet 0.4.0 soon.
EMBD is the fastest embedding library available.
EMBD: 328.44 ms ๐ฅ
ONNX: 553.31 ms ๐ฅ
GGML+CPU: 812.03 ms ๐ฅ
GGML+GPU: 985.82 ms
Candle: 1.0449s
One guy vs OSS community vs MSFT
Come and work with me!
We are hiring for an Apple focused On-Device ML engineer.
If you're a strong Swift developer & have worked with CoreML, MLX or Metal we want to hear from you!
Chrome 113 is here - finally ML models on WebGPU ๐
To celebrate we've shipped:
๐ฆ FLAN-Alpaca by
@soujanyaporia
๐ Generation modifiers
๐๏ธ A brand new look
Try it now:
Or build it into your web app instantly:
#WebGPU
#LLM
#Rust
๐จ Ratchet reaches alpha! ๐จ
With todays release of Distil Whisper Large V3 by
@sanchitgandhi99
, Ratchet officially enters alpha.
Check out this demo running ๐น๐ฎ๐ฟ๐ด๐ฒ-๐๐ฏ(!!) in the browser
My Rust + WebGPU ML framework is now fully OSS and capable of running Whisper (very slowly for now ๐)
Tons of low hanging fruit, and more models to be implemented!
Start contributing:
Say goodbye to the GPT-4 loading wheel ๐ โโ๏ธ
I've built a lightning-fast document editor with a local LLM in the browser using Rust + WebGPU ML runtime ๐
Check out my upcoming blog post for all the details.
#LLM
#rustlang
#WebGPU
Just published the very alpha version of EMBD
Supports huge batches of text, pretty much instant on the GPU.
Perfect for RAG or search.
Benchmarks + polished beta coming ~next week.
Very excited for whisper-turbo 0.9.0.
- Finally match OAI logit for logit. Essential for speculative decoding ๐
- Single function to call now - `transcribe()`
- Massive CI overhaul + end to end tests on Windows, MacOS & Linux.
Real time on the hit list next ๐ฏ
๐ Introducing Laserbeak: An NPM library to integrate LLMs into your Web/Electron apps instantly!
Powered by my Rust + WebGPU ML runtime, you can now run 250M+ param models in the browser!
Demo:
Repo:
#Rust
#WebGPU
#ML
๐จWhisper-Turbo 0.10.0 is out ๐จ
Huge release with tons of features:
- Full multilingual support, check out the translation of Bong Joon-ho discussing directing Parasite ๐ฐ๐ท
- 100% token for token match with OAI ๐ฐ
- Brand new docs site ๐
The CoreML release for iOS18 & MacOS Sequoia was all about LLMs.
Stateful Models + Multifunction are pretty exciting.
Check out Mistral 7B running in 4bit at 35tok/s on my machine!
Ratchet just works in the latest version of Safari Technology Preview - runs distil-whisper-large with ease ๐๏ธ
End of this year all major browsers will have support ๐
Shipped 0.7.0 of Whisper Turbo.
Ridiculously fast, cross-platform - and no bugs! ๐
Once we have real-time streaming from the ๐๏ธ people will start building amazing stuff on top of it.
Forward pass 84ms -> 58ms โฐ
Serious performance boost from my custom GEMV kernel.
Looking forward to packaging this up nicely and getting it out ๐ฅณ
Exciting news! The new LaMini ๐ฆ models by
@AlhamFikri
& co are seriously impressive - outperforming LLaMA 7B ๐
I've added support so it runs entirely in-browser!
Test it out on the playground ๐ now and share your thoughts:
#LLM
#Rust
#WebGPU
#AI
Spent the past 2 days hacking away at F16 support - the test shader now runs!
This means f16 will land in ๐ฅ๐ฆ & will allow us to accelerate Ratchet.
Good for the Rust ๐ค ML ecosystem!
200ms to go to match GGML Decoder! ๐ฅ
Encoder is now ~2.7x faster ๐๏ธ
Memory management is mostly solved, just KV caching the self-attention and some shader optimizations left!
SAM 2 ๐ค CoreML
Check out Segment Anything 2 from
@AIatMeta
running on the Neural Engine via SAM 2 Studio - our native MacOS app to make segmenting images easy!
Really enjoyed building this with
@pcuenq
&
@cyrilzakka
โ๏ธ
Very grateful to a few folks:
-
@xenovacom
for working with me throughout the process.
- Mathieu Poumeyrol who built Tract which set me on this road.
- Connor Fitzgerald for his patience schooling me on GPUs & wgpu.
- EleutherAI off-topic, you guys are awesome!
Given Whisper's small size, even large-V2 can be run on consumer hardware faster than real time โก๏ธ
With a bunch of added benefits like:
1. Real-time streaming
2. ~Zero latency
3. 100% private
Took a few days but I've now implemented quantised models in the browser!
~3x memory reduction ๐ and near instant loading times โฑ๏ธ
Thanks to
@CarsonPoole
for the pointers!
Try out the models:
My Rust + WebGPU ML framework is now fully OSS and capable of running Whisper (very slowly for now ๐)
Tons of low hanging fruit, and more models to be implemented!
Start contributing:
Decoder is 2x faster than before, but still 20x slower than GGML.
Encoder is 2x faster than GGML.
Decoder is slow for 2 reasons:
1. No KV caching.
2. Memory management.
Once buffer reuse ships I expect a 10x speedup ๐๏ธ
@Vjeux
Whisper turbo is the fastest whisper implementation in the browser:
Real time is still on the roadmap, it is quite a lot of development work to do it right.
Next release of whisper-turbo takes it from a buggy toy to something pretty useful.
- No more memory leaks.
- 2x+ file size reduction.
- And as always, speed improvements.
Super excited for people to start building on top of it!
This comment on HN stood out to me.
NNs seem to be trending towards dynamism:
Static Shapes (ResNet etc) -> Dynamic Shapes (Transformers seq_len) -> Mixture of Experts -> Mixture of Depths -> ???
This is pretty interesting, because even the iPhone 6S shipped with a 703Wh/L battery:
Which is still pretty much SOTA for lithium ion (~800Wh/L), despite being released 8 years ago.
Pretty cool how easy it is to export a Next site ->
@huggingface
space.
Check out whisper-turbo on HF - now with mic support!
I'll be shipping Distil-Whisper by
@sanchitgandhi99
to the space next week!
Google is pushing for DP4A (INT8 dot product of 4 elements & accumulate) to land soon on WebGPU.
This has the sole purpose of seriously speeding up quantized neural nets.
Wonder what they're working on? ๐
Imagine a robust, small box with a low wattage NPU.
Fit it with a solar panel and pack an LLM in there and you have this little monument to humans that could outlast us.
As chip development progresses into faster, more efficient NPU architectures, people are starting to notice the massive amount of untapped compute sitting right their devices. We're building the tools to make that compute accessible to everyone ๐ช
Not everything needs to be corporate - having a ton of fun with the Whisper Turbo demo.
Expect to see real-time Whisper Small+ in the browser next week ๐
Cross-attention caching is live ๐ฅ
Finally below 300ms! I think beating 36ms will be hard, as CPU -> GPU communication will be a larger % of runtime than for larger models.
After I ship self-attention KV caching I'll move on to shipping it ๐ข
whisper-turbo 0.10.0 comes out next week with a metric ton of improvements.
Best of all - the new docs site (WIP):
Slowly slowly approaching utility ๐ ๏ธ
Building ratchet was super fun, and I'm very proud of what I've achieved.
However, working completely solo for 6 months can stunt your creativity & certainly reduces the scope of what you can build.
Long story short - I'm on the job market!
Shipped F16 to whisper-turbo ๐๏ธ
237ms -> 179ms โฌ๏ธ
I'm stopping here for a few reasons:
1. Tiny + short sample is the WORST case for GPU.
2. For real-time, encode speed >> decode speed.
3. It is already the fastest Whisper in the browser.
๐จ whisper-turbo 0.5.3 ๐จ
๐ช Full Windows support
๐ผ Extra long samples
๐ Huge amount of bug fixes
We should be the fastest Whisper implementation on Windows now ๐ฅ
Check it out:
(And we secured the .com ๐)
Now that whisper-turbo is stable I'm excited to move on to other models.
Ratchet intends to explore what cross-platform ML looks like, with a focus on DX.
I'm looking for collaborators - my subpar GPU programming skills is leaving minimum 5x speed on the table.
Just released whisper-turbo v0.3.0 featuring:
๐ต Audio transcoding to enable transcription of mp3, mp4, m4a, wav and aac files.
๐ผ Support for samples longer than 30 seconds (WIP)
โญ๏ธ the repo to follow along as we approach beta
Cooked up a proc macro that allows you to embed WGSL in your Rust!
Allows passing values from Rust into the WGSL source, makes constructing kernels much easier ๐ฅ
Humbling when you realise you've lost your grasp on a foundational piece of knowledge. Worst thing to do in that situation is to avoid it.
Rotary Embeddings made me work my way back up through complex numbers, rotations & sinusoidal embeddings.
@bojone1993
is clearly a genius.
Working on something pretty cool.
GGML loader already works.
Compute graph is well defined.
Only thing left is a mini symbolic algebra library and some fast Q4 shaders.