Fleetwood Profile Banner
Fleetwood Profile
Fleetwood

@fleetwood___

1,668
Followers
779
Following
134
Media
508
Statuses

MLE ๐Ÿค—

London
Joined January 2013
Don't wanna be here? Send us removal request.
@fleetwood___
Fleetwood
4 months
Best tiled matmul animation I've found on the internet. Thanks @wentasah
13
215
2K
@fleetwood___
Fleetwood
2 months
Under a week on @gdb 's sabbatical before he cracked
Tweet media one
@gdb
Greg Brockman
2 months
Iโ€™m taking a sabbatical through end of year. First time to relax since co-founding OpenAI 9 years ago. The mission is far from complete; we still have a safe AGI to build.
594
486
8K
22
20
882
@fleetwood___
Fleetwood
3 months
A 7B parameter LLM consumes 0.7 J/token. A fully charged iPhone, with ~50kJ of energy, can sustain this LLM in conversation for <2 hours at a rate of 10 tok/s, with every 64 tokens draining 0.2% of the battery.
14
33
465
@fleetwood___
Fleetwood
1 year
๐ŸŽ๏ธ Introducing Whisper Turbo ๐ŸŽ๏ธ Engineered from scratch in Rust + WebGPU. Transcribe 20x faster than realtime - all in the browser! Here's why it's a game-changer:
19
67
421
@fleetwood___
Fleetwood
3 months
The web is the best platform for building fun tools. - FFMPEG compiled to WASM can convert ~any input format to WAV for Whisper. - Ratchet runs super fast Whisper entirely on the GPU to generate the transcript. - Runs on any hardware, offline, no install.
6
41
349
@fleetwood___
Fleetwood
1 year
๐Ÿšจ whisper-turbo v0.8 release ๐Ÿšจ What's new? All models have slimmed down ๐Ÿ‹๏ธ Tiny: 150M -> 51M Base: 280M -> 97M Small: 650M -> 310M Medium: 1.75G -> 970M Memory leaks == history ๐Ÿชฃ You can now transcribe *very* long samples. Check it out on GH:
11
46
322
@fleetwood___
Fleetwood
9 months
Excited to announce that I have joined Hugging Face ๐Ÿค— Looking forward to bringing some of the work I've done over the past 6 months to many more devices!
23
15
234
@fleetwood___
Fleetwood
6 months
๐Ÿšจ Phi-3 running in the browser ๐Ÿšจ Hits about 20 tok/s ๐ŸŽ๏ธ Literally 3 lines of JS. Still some kinks to iron out, coming to Ratchet 0.4.0 soon.
4
37
188
@fleetwood___
Fleetwood
11 months
EMBD is the fastest embedding library available. EMBD: 328.44 ms ๐Ÿฅ‡ ONNX: 553.31 ms ๐Ÿฅˆ GGML+CPU: 812.03 ms ๐Ÿฅ‰ GGML+GPU: 985.82 ms Candle: 1.0449s One guy vs OSS community vs MSFT
Tweet media one
10
13
164
@fleetwood___
Fleetwood
6 months
๐Ÿšจ Phi-3 is now live on a Space! ๐Ÿšจ Run a 3.8B parameter model offline, entirely in the browser ๐ŸŽ๏ธ Try it out for yourself:
5
30
144
@fleetwood___
Fleetwood
5 months
Come and work with me! We are hiring for an Apple focused On-Device ML engineer. If you're a strong Swift developer & have worked with CoreML, MLX or Metal we want to hear from you!
2
23
133
@fleetwood___
Fleetwood
1 year
Chrome 113 is here - finally ML models on WebGPU ๐ŸŽ‰ To celebrate we've shipped: ๐Ÿฆ™ FLAN-Alpaca by @soujanyaporia ๐Ÿ“ Generation modifiers ๐Ÿ‘๏ธ A brand new look Try it now: Or build it into your web app instantly: #WebGPU #LLM #Rust
3
15
102
@fleetwood___
Fleetwood
2 years
Summarising the @OpenAI charter in the browser with Rust + WebGPU. 250M parameters but still fast ๐Ÿš€ The race is on against @Microsoft before they release their WebGPU backend for ONNX. @visheratin @xenovacom @altryne
4
10
100
@fleetwood___
Fleetwood
5 months
Amazing diagram from @vrushankdes Burn this into your mind and remember it every time you're writing code that ends up on a GPU
2
11
87
@fleetwood___
Fleetwood
4 months
Apple's new code completion model for XCode is 2GB ๐Ÿ‘€
Tweet media one
3
2
85
@fleetwood___
Fleetwood
7 months
๐Ÿšจ Ratchet reaches alpha! ๐Ÿšจ With todays release of Distil Whisper Large V3 by @sanchitgandhi99 , Ratchet officially enters alpha. Check out this demo running ๐—น๐—ฎ๐—ฟ๐—ด๐—ฒ-๐˜ƒ๐Ÿฏ(!!) in the browser
7
12
86
@fleetwood___
Fleetwood
4 months
Run CoreML models on the Neural Engine seamlessly. Introducing deCoreML ๐ŸŽ
6
15
79
@fleetwood___
Fleetwood
8 months
My Rust + WebGPU ML framework is now fully OSS and capable of running Whisper (very slowly for now ๐ŸŒ) Tons of low hanging fruit, and more models to be implemented! Start contributing:
1
10
79
@fleetwood___
Fleetwood
2 years
Say goodbye to the GPT-4 loading wheel ๐Ÿ™…โ€โ™‚๏ธ I've built a lightning-fast document editor with a local LLM in the browser using Rust + WebGPU ML runtime ๐Ÿš€ Check out my upcoming blog post for all the details. #LLM #rustlang #WebGPU
3
10
76
@fleetwood___
Fleetwood
11 months
Just published the very alpha version of EMBD Supports huge batches of text, pretty much instant on the GPU. Perfect for RAG or search. Benchmarks + polished beta coming ~next week.
1
4
60
@fleetwood___
Fleetwood
6 months
Took Phi2 from 9 tok/s to 27 tok/s in 12 days. ๐Ÿšจ Web demo coming this week ๐Ÿšจ
@fleetwood___
Fleetwood
6 months
A week of absolute struggle but Phi2 officially runs on Ratchet ๐ŸŽบ Pretty sluggish right now ๐ŸŒ but lots of optimisation to come.
4
6
38
4
7
60
@fleetwood___
Fleetwood
7 months
Long & difficult optimisations, but huge 33% speedup as the payoff. Big release coming 21st March ๐Ÿ‘€
Tweet media one
Tweet media two
3
5
53
@fleetwood___
Fleetwood
11 months
Very excited for whisper-turbo 0.9.0. - Finally match OAI logit for logit. Essential for speculative decoding ๐Ÿ‘€ - Single function to call now - `transcribe()` - Massive CI overhaul + end to end tests on Windows, MacOS & Linux. Real time on the hit list next ๐ŸŽฏ
4
3
53
@fleetwood___
Fleetwood
1 year
๐Ÿš€ Introducing Laserbeak: An NPM library to integrate LLMs into your Web/Electron apps instantly! Powered by my Rust + WebGPU ML runtime, you can now run 250M+ param models in the browser! Demo: Repo: #Rust #WebGPU #ML
5
5
50
@fleetwood___
Fleetwood
5 months
Ported the MLX GEMV to WebGPU - 1.6x speedup ๐Ÿš€ Looking forward to integrating it into Ratchet ๐Ÿ”ง
Tweet media one
2
5
48
@fleetwood___
Fleetwood
11 months
๐ŸšจWhisper-Turbo 0.10.0 is out ๐Ÿšจ Huge release with tons of features: - Full multilingual support, check out the translation of Bong Joon-ho discussing directing Parasite ๐Ÿ‡ฐ๐Ÿ‡ท - 100% token for token match with OAI ๐ŸŸฐ - Brand new docs site ๐Ÿ“ƒ
2
10
47
@fleetwood___
Fleetwood
3 months
The CoreML release for iOS18 & MacOS Sequoia was all about LLMs. Stateful Models + Multifunction are pretty exciting. Check out Mistral 7B running in 4bit at 35tok/s on my machine!
3
3
44
@fleetwood___
Fleetwood
9 months
Shipped Quantised matmuls & inplace operations today: Fully OSS 3B+ models coming to a browser near you soon:
Tweet media one
1
8
42
@fleetwood___
Fleetwood
28 days
@danielhanchen @ZeyuanAllenZhu Same claim in the MobileLLM paper from @AIatMeta
Tweet media one
3
2
50
@fleetwood___
Fleetwood
3 months
Very cool paper from Meta:
0
2
40
@fleetwood___
Fleetwood
6 months
A week of absolute struggle but Phi2 officially runs on Ratchet ๐ŸŽบ Pretty sluggish right now ๐ŸŒ but lots of optimisation to come.
4
6
38
@fleetwood___
Fleetwood
4 months
Ratchet just works in the latest version of Safari Technology Preview - runs distil-whisper-large with ease ๐ŸŽ๏ธ End of this year all major browsers will have support ๐Ÿ“…
2
2
37
@fleetwood___
Fleetwood
1 year
Shipped 0.7.0 of Whisper Turbo. Ridiculously fast, cross-platform - and no bugs! ๐Ÿž Once we have real-time streaming from the ๐ŸŽ™๏ธ people will start building amazing stuff on top of it.
1
2
35
@fleetwood___
Fleetwood
4 months
Quantized MoonDream comes to Ratchet ๐Ÿ”ฅ Big thanks to Tim G for shipping this!
4
4
34
@fleetwood___
Fleetwood
1 year
@simonw Havenโ€™t quite shipped diarization yet but there I can offer 0 install and the best GUI: the browser
3
2
33
@fleetwood___
Fleetwood
6 months
Forward pass 84ms -> 58ms โฐ Serious performance boost from my custom GEMV kernel. Looking forward to packaging this up nicely and getting it out ๐Ÿฅณ
2
1
34
@fleetwood___
Fleetwood
1 year
Exciting news! The new LaMini ๐Ÿฆ™ models by @AlhamFikri & co are seriously impressive - outperforming LLaMA 7B ๐Ÿ“ˆ I've added support so it runs entirely in-browser! Test it out on the playground ๐Ÿ› now and share your thoughts: #LLM #Rust #WebGPU #AI
5
4
33
@fleetwood___
Fleetwood
5 months
Me in 2022 Google in 2024 ๐ŸŒ
@FanaHOVA
Alessio Fanelli
5 months
WebAssembly is aliveeeee ๐Ÿ”ฅ Gemini Nano will be included in Chrome itself with WebGPU + Wasm.
Tweet media one
5
38
217
1
2
33
@fleetwood___
Fleetwood
5 months
Spent the past 2 days hacking away at F16 support - the test shader now runs! This means f16 will land in ๐Ÿ”ฅ๐ŸฆŠ & will allow us to accelerate Ratchet. Good for the Rust ๐Ÿค ML ecosystem!
Tweet media one
2
1
31
@fleetwood___
Fleetwood
1 year
200ms to go to match GGML Decoder! ๐Ÿ”ฅ Encoder is now ~2.7x faster ๐ŸŽ๏ธ Memory management is mostly solved, just KV caching the self-attention and some shader optimizations left!
Tweet media one
3
1
28
@fleetwood___
Fleetwood
3 months
Switched over from Iterm2 -> WezTerm today. Took 5 minutes, performance is much smoother. Performance is a feature! Thanks @wezfurlong
1
1
27
@fleetwood___
Fleetwood
7 months
Ratchet is now officially under the HF namespace! Looking forward to the next stage for Ratchet ๐Ÿค—
0
1
29
@fleetwood___
Fleetwood
6 months
After a single day of optimising, Phi2 already looks pretty usable ๐ŸŽ๏ธ Just a few custom GEMV kernels away from another big boost.
0
1
28
@fleetwood___
Fleetwood
3 months
@austinvhuang The world needs more introductory work on ML compilers! I like these from @chipro & @petewarden :
2
3
27
@fleetwood___
Fleetwood
1 month
SAM 2 ๐Ÿค CoreML Check out Segment Anything 2 from @AIatMeta running on the Neural Engine via SAM 2 Studio - our native MacOS app to make segmenting images easy! Really enjoyed building this with @pcuenq & @cyrilzakka โš’๏ธ
1
1
26
@fleetwood___
Fleetwood
1 year
Turbo intends to be a drop-in replacement for the OpenAI API. No more monthly costs, just a single `npm install whisper-turbo` ๐Ÿ’ป
1
0
24
@fleetwood___
Fleetwood
9 months
Very grateful to a few folks: - @xenovacom for working with me throughout the process. - Mathieu Poumeyrol who built Tract which set me on this road. - Connor Fitzgerald for his patience schooling me on GPUs & wgpu. - EleutherAI off-topic, you guys are awesome!
1
0
21
@fleetwood___
Fleetwood
1 year
Given Whisper's small size, even large-V2 can be run on consumer hardware faster than real time โšก๏ธ With a bunch of added benefits like: 1. Real-time streaming 2. ~Zero latency 3. 100% private
1
0
22
@fleetwood___
Fleetwood
1 year
Took a few days but I've now implemented quantised models in the browser! ~3x memory reduction ๐Ÿ“‰ and near instant loading times โฑ๏ธ Thanks to @CarsonPoole for the pointers! Try out the models:
2
0
22
@fleetwood___
Fleetwood
8 months
10 days apart... Optimisation will never get old ๐ŸŽ๏ธ
@fleetwood___
Fleetwood
8 months
My Rust + WebGPU ML framework is now fully OSS and capable of running Whisper (very slowly for now ๐ŸŒ) Tons of low hanging fruit, and more models to be implemented! Start contributing:
1
10
79
3
1
23
@fleetwood___
Fleetwood
7 months
Any sufficiently advanced ML framework is indistinguishable from a compiler.
2
1
23
@fleetwood___
Fleetwood
4 months
3 weeks to move Ratchet from AOT -> JIT. Now the real exciting stuff starts.
Tweet media one
2
2
22
@fleetwood___
Fleetwood
1 year
Decoder is 2x faster than before, but still 20x slower than GGML. Encoder is 2x faster than GGML. Decoder is slow for 2 reasons: 1. No KV caching. 2. Memory management. Once buffer reuse ships I expect a 10x speedup ๐ŸŽ๏ธ
Tweet media one
2
0
22
@fleetwood___
Fleetwood
3 months
@RoyShilkrot Worst of both worlds! This is why we are seeing both Google & Apple converging to ~same size on device LLM (3.25B + adapters)
3
0
20
@fleetwood___
Fleetwood
3 months
Blowing my mind that the Meta Smart glasses are run entirely off a 154mAh battery. Literally 1.5 OOM less than your phone.
Tweet media one
Tweet media two
1
1
19
@fleetwood___
Fleetwood
4 months
Only @Apple could ship on-device ML to this standard. Impossible without complete vertical control.
1
4
18
@fleetwood___
Fleetwood
1 year
Next release of whisper-turbo takes it from a buggy toy to something pretty useful. - No more memory leaks. - 2x+ file size reduction. - And as always, speed improvements. Super excited for people to start building on top of it!
3
0
19
@fleetwood___
Fleetwood
5 months
This comment on HN stood out to me. NNs seem to be trending towards dynamism: Static Shapes (ResNet etc) -> Dynamic Shapes (Transformers seq_len) -> Mixture of Experts -> Mixture of Depths -> ???
Tweet media one
1
1
19
@fleetwood___
Fleetwood
9 months
Whisper Turbo on Android ๐Ÿ”ฅ No one is ready for when WebGPU lands on iOS (already available in preview).
@AdjectiveAlli
Allison
9 months
Whisper turbo transcribes on mobile android with webGPU pretty quickly. Watch out for the LLMs coming soon.
Tweet media one
2
4
31
0
0
18
@fleetwood___
Fleetwood
1 year
WebGPU Whisper decoder is officially online ๐Ÿ—ฃ๏ธ I think I've left about 10x perf on the table. "Make It Work, Make It Right, Make It Fast"
Tweet media one
2
0
18
@fleetwood___
Fleetwood
3 months
This is pretty interesting, because even the iPhone 6S shipped with a 703Wh/L battery: Which is still pretty much SOTA for lithium ion (~800Wh/L), despite being released 8 years ago.
0
0
17
@fleetwood___
Fleetwood
1 year
Before and after ๐Ÿ”ฅ Should see an immediate 2x speedup once this is integrated into whisper-turbo. Follow along here
Tweet media one
0
0
18
@fleetwood___
Fleetwood
1 year
Pretty cool how easy it is to export a Next site -> @huggingface space. Check out whisper-turbo on HF - now with mic support! I'll be shipping Distil-Whisper by @sanchitgandhi99 to the space next week!
1
2
17
@fleetwood___
Fleetwood
5 months
Met more intelligent & ambitious people in 1 week in SF than 2 years in London ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿฆ… Grateful for everyone out here @viccpoes @pebble_bed ๐Ÿค—
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
0
16
@fleetwood___
Fleetwood
1 year
Google is pushing for DP4A (INT8 dot product of 4 elements & accumulate) to land soon on WebGPU. This has the sole purpose of seriously speeding up quantized neural nets. Wonder what they're working on? ๐Ÿ‘€
0
0
17
@fleetwood___
Fleetwood
3 months
๐Ÿšจ New blog post ๐Ÿšจ A first principles analysis of smartglasses:
2
2
16
@fleetwood___
Fleetwood
3 months
Imagine a robust, small box with a low wattage NPU. Fit it with a solar panel and pack an LLM in there and you have this little monument to humans that could outlast us.
Tweet media one
3
1
16
@fleetwood___
Fleetwood
5 months
Agree 100% - check out this graph I made in 2022 looking at Apple mobile GPUs! Never fight a graph that looks like this.
Tweet media one
@zachnagengast
Zach Nagengast
5 months
As chip development progresses into faster, more efficient NPU architectures, people are starting to notice the massive amount of untapped compute sitting right their devices. We're building the tools to make that compute accessible to everyone ๐Ÿ’ช
0
3
28
1
2
16
@fleetwood___
Fleetwood
1 year
Not everything needs to be corporate - having a ton of fun with the Whisper Turbo demo. Expect to see real-time Whisper Small+ in the browser next week ๐Ÿ‘€
Tweet media one
2
0
15
@fleetwood___
Fleetwood
1 year
Cross-attention caching is live ๐Ÿ”ฅ Finally below 300ms! I think beating 36ms will be hard, as CPU -> GPU communication will be a larger % of runtime than for larger models. After I ship self-attention KV caching I'll move on to shipping it ๐Ÿšข
Tweet media one
1
0
15
@fleetwood___
Fleetwood
1 year
Whisper encoder officially runs on my framework ๐Ÿ—ฃ๏ธ Already outperforms GGML CPU on tiny with unoptimised kernels.
Tweet media one
3
1
15
@fleetwood___
Fleetwood
1 year
Decoder 2x faster again today. Encoder is now 2.5x faster than GGML. Half of memory management solved - still a way to go to beat 36ms.
Tweet media one
1
0
15
@fleetwood___
Fleetwood
4 months
@karpathy @natfriedman Check out Bill Gates timesharing in '68! Same transition to the edge will happen.
Tweet media one
0
1
15
@fleetwood___
Fleetwood
10 months
Put together a simple animation demonstrating GPU reductions for my upcoming blog post. You can grok this in 2 seconds compared to a wall of text.
1
0
14
@fleetwood___
Fleetwood
4 months
60x in 6 years...
Tweet media one
2
5
14
@fleetwood___
Fleetwood
11 months
whisper-turbo 0.10.0 comes out next week with a metric ton of improvements. Best of all - the new docs site (WIP): Slowly slowly approaching utility ๐Ÿ› ๏ธ
1
1
14
@fleetwood___
Fleetwood
10 months
Wrote a sweet kernel bencher for WebGPU and spent the last few days optimizing a LayerNorm kernel. Blog post explaining the process coming soon ๐Ÿ“–
Tweet media one
1
0
14
@fleetwood___
Fleetwood
1 year
๐Ÿ Who would win? ๐Ÿ @ggerganov 's Whisper.cpp running at 4x speed OR @fleetwood___ Whisper Turbo in real time Let's find out!
3
0
14
@fleetwood___
Fleetwood
3 months
Thanks to @HugoDuprez for creating the app. Check out the repo:
0
1
14
@fleetwood___
Fleetwood
4 months
deCoreML gives you insights into why a backend was selected for a particular operation! Try it out now: pip install decoreml Check out the repo:
2
2
13
@fleetwood___
Fleetwood
6 months
Every engineer should be forced to write a vectorised GEMM in a shitty templating language... for character building.
Tweet media one
3
0
14
@fleetwood___
Fleetwood
11 months
Building ratchet was super fun, and I'm very proud of what I've achieved. However, working completely solo for 6 months can stunt your creativity & certainly reduces the scope of what you can build. Long story short - I'm on the job market!
2
3
13
@fleetwood___
Fleetwood
8 months
Ratchet alpha demo releases this week.
Tweet media one
0
0
13
@fleetwood___
Fleetwood
9 months
Biggest lesson from 2023 It's agency, not intellect, that's in short supply.
0
1
13
@fleetwood___
Fleetwood
3 months
Dropping this next week ๐Ÿ“
Tweet media one
Tweet media two
1
1
13
@fleetwood___
Fleetwood
1 year
Shipped F16 to whisper-turbo ๐ŸŽ๏ธ 237ms -> 179ms โฌ‡๏ธ I'm stopping here for a few reasons: 1. Tiny + short sample is the WORST case for GPU. 2. For real-time, encode speed >> decode speed. 3. It is already the fastest Whisper in the browser.
Tweet media one
1
0
12
@fleetwood___
Fleetwood
7 months
Wrote a fun little CLI for Ratchet to make experimenting with local models easy! Looking forward to supporting more models soon ๐Ÿ”œ
1
1
12
@fleetwood___
Fleetwood
1 year
Plateaued while I implement some deep structural changes required for KV caching. Once KV caching lands, should hit below 150ms ๐Ÿ”ฅ
Tweet media one
0
0
13
@fleetwood___
Fleetwood
1 year
๐Ÿšจ whisper-turbo 0.5.3 ๐Ÿšจ ๐ŸชŸ Full Windows support ๐ŸŽผ Extra long samples ๐Ÿž Huge amount of bug fixes We should be the fastest Whisper implementation on Windows now ๐Ÿ”ฅ Check it out: (And we secured the .com ๐Ÿ”)
2
0
12
@fleetwood___
Fleetwood
11 months
Now that whisper-turbo is stable I'm excited to move on to other models. Ratchet intends to explore what cross-platform ML looks like, with a focus on DX. I'm looking for collaborators - my subpar GPU programming skills is leaving minimum 5x speed on the table.
1
0
13
@fleetwood___
Fleetwood
1 year
Just released whisper-turbo v0.3.0 featuring: ๐ŸŽต Audio transcoding to enable transcription of mp3, mp4, m4a, wav and aac files. ๐ŸŽผ Support for samples longer than 30 seconds (WIP) โญ๏ธ the repo to follow along as we approach beta
1
2
12
@fleetwood___
Fleetwood
5 months
Cooked up a proc macro that allows you to embed WGSL in your Rust! Allows passing values from Rust into the WGSL source, makes constructing kernels much easier ๐Ÿ”ฅ
Tweet media one
1
2
12
@fleetwood___
Fleetwood
6 months
Humbling when you realise you've lost your grasp on a foundational piece of knowledge. Worst thing to do in that situation is to avoid it. Rotary Embeddings made me work my way back up through complex numbers, rotations & sinusoidal embeddings. @bojone1993 is clearly a genius.
0
0
11
@fleetwood___
Fleetwood
1 year
Working on something pretty cool. GGML loader already works. Compute graph is well defined. Only thing left is a mini symbolic algebra library and some fast Q4 shaders.
Tweet media one
Tweet media two
1
0
11
@fleetwood___
Fleetwood
10 months
New blog post showing how to iteratively arrive at the "fastest" LayerNorm kernel!
0
3
11