Fleetwood @fleetwood___ Twitter profile

Last Seen Profiles

@Visit_Zaragoza

@NoticiasdaTV

@InstBernabeu

@serturk32

@eddyquart

@einsaaaad

@shivaniswiftie

@tomaeusTo

@desinurpangesti

@Ty_gallagher9

@Magicstarsx

@twionjj58

@NBKzealot

@OmanEmbassyCN

@agkdesign

@azusadancy

@Yavuzaydn61

@DawsonBlue42

@Toomtam062

@Sesscysesz4J1

@itsJGrip

@JanivakL

@ChrisViolaNerd

@jeonjeekoo

@an5570197946025

@Yurimiga

@zukizuki_bean

@cerezobaka

@Cloupinho

@Bloqueos_Oaxaca

@nlfsvn

@Lndcalling

@mozarros1

@chiru_hahhh

@oliveirxl7

@jandakembangstw

Fleetwood

@fleetwood___

4 months

Best tiled matmul animation I've found on the internet. Thanks @wentasah

13

215

2K

Fleetwood

@fleetwood___

2 months

Under a week on @gdb 's sabbatical before he cracked

Greg Brockman

@gdb

2 months

I’m taking a sabbatical through end of year. First time to relax since co-founding OpenAI 9 years ago. The mission is far from complete; we still have a safe AGI to build.

594

486

8K

22

20

882

Fleetwood

@fleetwood___

3 months

A 7B parameter LLM consumes 0.7 J/token. A fully charged iPhone, with ~50kJ of energy, can sustain this LLM in conversation for <2 hours at a rate of 10 tok/s, with every 64 tokens draining 0.2% of the battery.

14

33

465

Fleetwood

@fleetwood___

1 year

🏎️ Introducing Whisper Turbo 🏎️ Engineered from scratch in Rust + WebGPU. Transcribe 20x faster than realtime - all in the browser! Here's why it's a game-changer:

19

67

421

Fleetwood

@fleetwood___

3 months

The web is the best platform for building fun tools. - FFMPEG compiled to WASM can convert ~any input format to WAV for Whisper. - Ratchet runs super fast Whisper entirely on the GPU to generate the transcript. - Runs on any hardware, offline, no install.

6

41

349

Fleetwood

@fleetwood___

1 year

🚨 whisper-turbo v0.8 release 🚨 What's new? All models have slimmed down 🏋️ Tiny: 150M -> 51M Base: 280M -> 97M Small: 650M -> 310M Medium: 1.75G -> 970M Memory leaks == history 🪣 You can now transcribe *very* long samples. Check it out on GH:

GitHub - FL33TW00D/whisper-turbo: Cross-Platform, GPU Accelerated Whisper 🏎️

Cross-Platform, GPU Accelerated Whisper 🏎️. Contribute to FL33TW00D/whisper-turbo development by creating an account on GitHub.

github.com

11

46

322

Fleetwood

@fleetwood___

9 months

Excited to announce that I have joined Hugging Face 🤗 Looking forward to bringing some of the work I've done over the past 6 months to many more devices!

23

15

234

Fleetwood

@fleetwood___

6 months

🚨 Phi-3 running in the browser 🚨 Hits about 20 tok/s 🏎️ Literally 3 lines of JS. Still some kinks to iron out, coming to Ratchet 0.4.0 soon.

4

37

188

Fleetwood

@fleetwood___

11 months

EMBD is the fastest embedding library available. EMBD: 328.44 ms 🥇 ONNX: 553.31 ms 🥈 GGML+CPU: 812.03 ms 🥉 GGML+GPU: 985.82 ms Candle: 1.0449s One guy vs OSS community vs MSFT

10

13

164

Fleetwood

@fleetwood___

6 months

🚨 Phi-3 is now live on a Space! 🚨 Run a 3.8B parameter model offline, entirely in the browser 🏎️ Try it out for yourself:

5

30

144

Fleetwood

@fleetwood___

5 months

Come and work with me! We are hiring for an Apple focused On-Device ML engineer. If you're a strong Swift developer & have worked with CoreML, MLX or Metal we want to hear from you!

Hugging Face

Here at Hugging Face, we’re on a journey to advance and democratize ML for everyone. Along the way, we contribute to the development of technology for the better.

apply.workable.com

2

23

133

Fleetwood

@fleetwood___

1 year

Chrome 113 is here - finally ML models on WebGPU 🎉 To celebrate we've shipped: 🦙 FLAN-Alpaca by @soujanyaporia 📝 Generation modifiers 👁️ A brand new look Try it now: Or build it into your web app instantly: #WebGPU #LLM #Rust

3

15

102

Fleetwood

@fleetwood___

2 years

Summarising the @OpenAI charter in the browser with Rust + WebGPU. 250M parameters but still fast 🚀 The race is on against @Microsoft before they release their WebGPU backend for ONNX. @visheratin @xenovacom @altryne

4

10

100

Fleetwood

@fleetwood___

5 months

Amazing diagram from @vrushankdes Burn this into your mind and remember it every time you're writing code that ends up on a GPU

2

11

87

Fleetwood

@fleetwood___

4 months

Apple's new code completion model for XCode is 2GB 👀

3

2

85

Fleetwood

@fleetwood___

1 year

Finally one of the big players is bringing ML in Rust to the masses 🦀 Look forward to seeing where it goes @huggingface

GitHub - huggingface/candle: Minimalist ML framework for Rust

Minimalist ML framework for Rust. Contribute to huggingface/candle development by creating an account on GitHub.

github.com

1

12

86

Fleetwood

@fleetwood___

7 months

🚨 Ratchet reaches alpha! 🚨 With todays release of Distil Whisper Large V3 by @sanchitgandhi99 , Ratchet officially enters alpha. Check out this demo running 𝗹𝗮𝗿𝗴𝗲-𝘃𝟯(!!) in the browser

7

12

86

Fleetwood

@fleetwood___

4 months

Run CoreML models on the Neural Engine seamlessly. Introducing deCoreML 🍎

6

15

79

Fleetwood

@fleetwood___

8 months

My Rust + WebGPU ML framework is now fully OSS and capable of running Whisper (very slowly for now 🐌) Tons of low hanging fruit, and more models to be implemented! Start contributing:

1

10

79

Fleetwood

@fleetwood___

2 years

Say goodbye to the GPT-4 loading wheel 🙅‍♂️ I've built a lightning-fast document editor with a local LLM in the browser using Rust + WebGPU ML runtime 🚀 Check out my upcoming blog post for all the details. #LLM #rustlang #WebGPU

3

10

76

Fleetwood

@fleetwood___

1 year

The library is open source, check out the alpha now: Using the Whisper API? Reach out and tell me what features you'd need!

GitHub - FL33TW00D/whisper-turbo: Cross-Platform, GPU Accelerated Whisper 🏎️

Cross-Platform, GPU Accelerated Whisper 🏎️. Contribute to FL33TW00D/whisper-turbo development by creating an account on GitHub.

github.com

2

5

63

Fleetwood

@fleetwood___

11 months

Just published the very alpha version of EMBD Supports huge batches of text, pretty much instant on the GPU. Perfect for RAG or search. Benchmarks + polished beta coming ~next week.

1

4

60

Fleetwood

@fleetwood___

6 months

Took Phi2 from 9 tok/s to 27 tok/s in 12 days. 🚨 Web demo coming this week 🚨

Fleetwood

@fleetwood___

6 months

A week of absolute struggle but Phi2 officially runs on Ratchet 🎺 Pretty sluggish right now 🐌 but lots of optimisation to come.

4

6

38

4

7

60

Fleetwood

@fleetwood___

7 months

Long & difficult optimisations, but huge 33% speedup as the payoff. Big release coming 21st March 👀

3

5

53

Fleetwood

@fleetwood___

11 months

Very excited for whisper-turbo 0.9.0. - Finally match OAI logit for logit. Essential for speculative decoding 👀 - Single function to call now - `transcribe()` - Massive CI overhaul + end to end tests on Windows, MacOS & Linux. Real time on the hit list next 🎯

4

3

53

Fleetwood

@fleetwood___

1 year

🚀 Introducing Laserbeak: An NPM library to integrate LLMs into your Web/Electron apps instantly! Powered by my Rust + WebGPU ML runtime, you can now run 250M+ param models in the browser! Demo: Repo: #Rust #WebGPU #ML

5

50

Fleetwood

@fleetwood___

5 months

Ported the MLX GEMV to WebGPU - 1.6x speedup 🚀 Looking forward to integrating it into Ratchet 🔧

2

5

48

Fleetwood

@fleetwood___

11 months

🚨Whisper-Turbo 0.10.0 is out 🚨 Huge release with tons of features: - Full multilingual support, check out the translation of Bong Joon-ho discussing directing Parasite 🇰🇷 - 100% token for token match with OAI 🟰 - Brand new docs site 📃

2

10

47

Fleetwood

@fleetwood___

3 months

The CoreML release for iOS18 & MacOS Sequoia was all about LLMs. Stateful Models + Multifunction are pretty exciting. Check out Mistral 7B running in 4bit at 35tok/s on my machine!

3

44

Fleetwood

@fleetwood___

9 months

Shipped Quantised matmuls & inplace operations today: Fully OSS 3B+ models coming to a browser near you soon:

1

8

42

Fleetwood

@fleetwood___

28 days

@danielhanchen @ZeyuanAllenZhu Same claim in the MobileLLM paper from @AIatMeta

3

2

50

Fleetwood

@fleetwood___

3 months

Very cool paper from Meta:

0

2

40

Fleetwood

@fleetwood___

6 months

A week of absolute struggle but Phi2 officially runs on Ratchet 🎺 Pretty sluggish right now 🐌 but lots of optimisation to come.

4

6

38

Fleetwood

@fleetwood___

4 months

Ratchet just works in the latest version of Safari Technology Preview - runs distil-whisper-large with ease 🏎️ End of this year all major browsers will have support 📅

2

37

Fleetwood

@fleetwood___

1 year

Shipped 0.7.0 of Whisper Turbo. Ridiculously fast, cross-platform - and no bugs! 🐞 Once we have real-time streaming from the 🎙️ people will start building amazing stuff on top of it.

1

2

35

Fleetwood

@fleetwood___

4 months

Quantized MoonDream comes to Ratchet 🔥 Big thanks to Tim G for shipping this!

4

34

Fleetwood

@fleetwood___

1 year

@simonw Haven’t quite shipped diarization yet but there I can offer 0 install and the best GUI: the browser

3

2

33

Fleetwood

@fleetwood___

6 months

Forward pass 84ms -> 58ms ⏰ Serious performance boost from my custom GEMV kernel. Looking forward to packaging this up nicely and getting it out 🥳

2

1

34

Fleetwood

@fleetwood___

1 year

Exciting news! The new LaMini 🦙 models by @AlhamFikri & co are seriously impressive - outperforming LLaMA 7B 📈 I've added support so it runs entirely in-browser! Test it out on the playground 🛝 now and share your thoughts: #LLM #Rust #WebGPU #AI

5

4

33

Fleetwood

@fleetwood___

5 months

Me in 2022 Google in 2024 🐌

Alessio Fanelli

@FanaHOVA

5 months

WebAssembly is aliveeeee 🔥 Gemini Nano will be included in Chrome itself with WebGPU + Wasm.

5

38

217

1

2

33

Fleetwood

@fleetwood___

5 months

Spent the past 2 days hacking away at F16 support - the test shader now runs! This means f16 will land in 🔥🦊 & will allow us to accelerate Ratchet. Good for the Rust 🤝 ML ecosystem!

2

1

31

Fleetwood

@fleetwood___

1 year

200ms to go to match GGML Decoder! 🔥 Encoder is now ~2.7x faster 🏎️ Memory management is mostly solved, just KV caching the self-attention and some shader optimizations left!

3

1

28

Fleetwood

@fleetwood___

3 months

Switched over from Iterm2 -> WezTerm today. Took 5 minutes, performance is much smoother. Performance is a feature! Thanks @wezfurlong

1

27

Fleetwood

@fleetwood___

7 months

Ratchet is now officially under the HF namespace! Looking forward to the next stage for Ratchet 🤗

0

1

29

Fleetwood

@fleetwood___

6 months

After a single day of optimising, Phi2 already looks pretty usable 🏎️ Just a few custom GEMV kernels away from another big boost.

0

1

28

Fleetwood

@fleetwood___

3 months

@austinvhuang The world needs more introductory work on ML compilers! I like these from @chipro & @petewarden :

2

3

27

Fleetwood

@fleetwood___

1 month

SAM 2 🤝 CoreML Check out Segment Anything 2 from @AIatMeta running on the Neural Engine via SAM 2 Studio - our native MacOS app to make segmenting images easy! Really enjoyed building this with @pcuenq & @cyrilzakka ⚒️

1

26

Fleetwood

@fleetwood___

1 year

Turbo intends to be a drop-in replacement for the OpenAI API. No more monthly costs, just a single `npm install whisper-turbo` 💻

1

0

24

Fleetwood

@fleetwood___

9 months

Very grateful to a few folks: - @xenovacom for working with me throughout the process. - Mathieu Poumeyrol who built Tract which set me on this road. - Connor Fitzgerald for his patience schooling me on GPUs & wgpu. - EleutherAI off-topic, you guys are awesome!

1

0

21

Fleetwood

@fleetwood___

1 year

Given Whisper's small size, even large-V2 can be run on consumer hardware faster than real time ⚡️ With a bunch of added benefits like: 1. Real-time streaming 2. ~Zero latency 3. 100% private

1

0

22

Fleetwood

@fleetwood___

1 year

Took a few days but I've now implemented quantised models in the browser! ~3x memory reduction 📉 and near instant loading times ⏱️ Thanks to @CarsonPoole for the pointers! Try out the models:

2

0

22

Fleetwood

@fleetwood___

8 months

10 days apart... Optimisation will never get old 🏎️

Fleetwood

@fleetwood___

8 months

My Rust + WebGPU ML framework is now fully OSS and capable of running Whisper (very slowly for now 🐌) Tons of low hanging fruit, and more models to be implemented! Start contributing:

1

10

79

3

1

23

Fleetwood

@fleetwood___

7 months

Any sufficiently advanced ML framework is indistinguishable from a compiler.

2

1

23

Fleetwood

@fleetwood___

4 months

3 weeks to move Ratchet from AOT -> JIT. Now the real exciting stuff starts.

2

22

Fleetwood

@fleetwood___

1 year

Decoder is 2x faster than before, but still 20x slower than GGML. Encoder is 2x faster than GGML. Decoder is slow for 2 reasons: 1. No KV caching. 2. Memory management. Once buffer reuse ships I expect a 10x speedup 🏎️

2

0

22

Fleetwood

@fleetwood___

3 months

@RoyShilkrot Worst of both worlds! This is why we are seeing both Google & Apple converging to ~same size on device LLM (3.25B + adapters)

3

0

20

Fleetwood

@fleetwood___

3 months

Blowing my mind that the Meta Smart glasses are run entirely off a 154mAh battery. Literally 1.5 OOM less than your phone.

1

19

Fleetwood

@fleetwood___

4 months

Only @Apple could ship on-device ML to this standard. Impossible without complete vertical control.

1

4

18

Fleetwood

@fleetwood___

11 months

@Vjeux Whisper turbo is the fastest whisper implementation in the browser: Real time is still on the roadmap, it is quite a lot of development work to do it right.

GitHub - FL33TW00D/whisper-turbo: Cross-Platform, GPU Accelerated Whisper 🏎️

Cross-Platform, GPU Accelerated Whisper 🏎️. Contribute to FL33TW00D/whisper-turbo development by creating an account on GitHub.

github.com

1

19

Fleetwood

@fleetwood___

1 year

Next release of whisper-turbo takes it from a buggy toy to something pretty useful. - No more memory leaks. - 2x+ file size reduction. - And as always, speed improvements. Super excited for people to start building on top of it!

3

0

19

Fleetwood

@fleetwood___

5 months

This comment on HN stood out to me. NNs seem to be trending towards dynamism: Static Shapes (ResNet etc) -> Dynamic Shapes (Transformers seq_len) -> Mixture of Experts -> Mixture of Depths -> ???

1

19

Fleetwood

@fleetwood___

9 months

Whisper Turbo on Android 🔥 No one is ready for when WebGPU lands on iOS (already available in preview).

Allison

@AdjectiveAlli

9 months

Whisper turbo transcribes on mobile android with webGPU pretty quickly. Watch out for the LLMs coming soon.

2

4

31

0

18

Fleetwood

@fleetwood___

1 year

WebGPU Whisper decoder is officially online 🗣️ I think I've left about 10x perf on the table. "Make It Work, Make It Right, Make It Fast"

2

0

18

Fleetwood

@fleetwood___

3 months

This is pretty interesting, because even the iPhone 6S shipped with a 703Wh/L battery: Which is still pretty much SOTA for lithium ion (~800Wh/L), despite being released 8 years ago.

0

17

Fleetwood

@fleetwood___

1 year

Before and after 🔥 Should see an immediate 2x speedup once this is integrated into whisper-turbo. Follow along here

0

18

Fleetwood

@fleetwood___

1 year

Pretty cool how easy it is to export a Next site -> @huggingface space. Check out whisper-turbo on HF - now with mic support! I'll be shipping Distil-Whisper by @sanchitgandhi99 to the space next week!

Whisper Turbo - a Hugging Face Space by FL33TW00D

huggingface.co

1

2

17

Fleetwood

@fleetwood___

5 months

Met more intelligent & ambitious people in 1 week in SF than 2 years in London 🇺🇸🦅 Grateful for everyone out here @viccpoes @pebble_bed 🤗

1

0

16

Fleetwood

@fleetwood___

1 year

Google is pushing for DP4A (INT8 dot product of 4 elements & accumulate) to land soon on WebGPU. This has the sole purpose of seriously speeding up quantized neural nets. Wonder what they're working on? 👀

WGSL: Add dot4U8Packed and dot4I8Packed as new built-in functions by Jiawei-Shao · Pull Request...

fixes: #2677

github.com

0

17

Fleetwood

@fleetwood___

3 months

🚨 New blog post 🚨 A first principles analysis of smartglasses:

2

16

Fleetwood

@fleetwood___

3 months

Imagine a robust, small box with a low wattage NPU. Fit it with a solar panel and pack an LLM in there and you have this little monument to humans that could outlast us.

3

1

16

Fleetwood

@fleetwood___

5 months

Agree 100% - check out this graph I made in 2022 looking at Apple mobile GPUs! Never fight a graph that looks like this.

Zach Nagengast

@zachnagengast

5 months

As chip development progresses into faster, more efficient NPU architectures, people are starting to notice the massive amount of untapped compute sitting right their devices. We're building the tools to make that compute accessible to everyone 💪

0

3

28

1

2

16

Fleetwood

@fleetwood___

1 year

Not everything needs to be corporate - having a ton of fun with the Whisper Turbo demo. Expect to see real-time Whisper Small+ in the browser next week 👀

2

0

15

Fleetwood

@fleetwood___

1 year

Cross-attention caching is live 🔥 Finally below 300ms! I think beating 36ms will be hard, as CPU -> GPU communication will be a larger % of runtime than for larger models. After I ship self-attention KV caching I'll move on to shipping it 🚢

1

0

15

Fleetwood

@fleetwood___

1 year

Whisper encoder officially runs on my framework 🗣️ Already outperforms GGML CPU on tiny with unoptimised kernels.

3

1

15

Fleetwood

@fleetwood___

1 year

Decoder 2x faster again today. Encoder is now 2.5x faster than GGML. Half of memory management solved - still a way to go to beat 36ms.

1

0

15

Fleetwood

@fleetwood___

4 months

@karpathy @natfriedman Check out Bill Gates timesharing in '68! Same transition to the edge will happen.

0

1

15

Fleetwood

@fleetwood___

10 months

Put together a simple animation demonstrating GPU reductions for my upcoming blog post. You can grok this in 2 seconds compared to a wall of text.

1

0

14

Fleetwood

@fleetwood___

11 months

@MayankG89593135

GitHub - FL33TW00D/embd: GPU accelerated client-side embeddings for vector search, RAG etc.

GPU accelerated client-side embeddings for vector search, RAG etc. - FL33TW00D/embd

github.com

0

14

Fleetwood

@fleetwood___

4 months

60x in 6 years...

2

5

14

Fleetwood

@fleetwood___

11 months

whisper-turbo 0.10.0 comes out next week with a metric ton of improvements. Best of all - the new docs site (WIP): Slowly slowly approaching utility 🛠️

1

14

Fleetwood

@fleetwood___

10 months

Wrote a sweet kernel bencher for WebGPU and spent the last few days optimizing a LayerNorm kernel. Blog post explaining the process coming soon 📖

1

0

14

Fleetwood

@fleetwood___

1 year

🏁 Who would win? 🏁 @ggerganov 's Whisper.cpp running at 4x speed OR @fleetwood___ Whisper Turbo in real time Let's find out!

3

0

14

Fleetwood

@fleetwood___

3 months

Thanks to @HugoDuprez for creating the app. Check out the repo:

0

1

14

Fleetwood

@fleetwood___

4 months

deCoreML gives you insights into why a backend was selected for a particular operation! Try it out now: pip install decoreml Check out the repo:

GitHub - FL33TW00D/deCoreML: Find out why your CoreML model isn't running on the Neural Engine!

Find out why your CoreML model isn't running on the Neural Engine! - FL33TW00D/deCoreML

github.com

2

13

Fleetwood

@fleetwood___

6 months

Every engineer should be forced to write a vectorised GEMM in a shitty templating language... for character building.

3

0

14

Fleetwood

@fleetwood___

11 months

Building ratchet was super fun, and I'm very proud of what I've achieved. However, working completely solo for 6 months can stunt your creativity & certainly reduces the scope of what you can build. Long story short - I'm on the job market!

2

3

13

Fleetwood

@fleetwood___

8 months

Ratchet alpha demo releases this week.

0

13

Fleetwood

@fleetwood___

9 months

Biggest lesson from 2023 It's agency, not intellect, that's in short supply.

0

1

13

Fleetwood

@fleetwood___

3 months

Dropping this next week 📝

1

13

Fleetwood

@fleetwood___

1 year

Shipped F16 to whisper-turbo 🏎️ 237ms -> 179ms ⬇️ I'm stopping here for a few reasons: 1. Tiny + short sample is the WORST case for GPU. 2. For real-time, encode speed >> decode speed. 3. It is already the fastest Whisper in the browser.

1

0

12

Fleetwood

@fleetwood___

7 months

Wrote a fun little CLI for Ratchet to make experimenting with local models easy! Looking forward to supporting more models soon 🔜

1

12

Fleetwood

@fleetwood___

1 year

Plateaued while I implement some deep structural changes required for KV caching. Once KV caching lands, should hit below 150ms 🔥

0

13

Fleetwood

@fleetwood___

1 year

🚨 whisper-turbo 0.5.3 🚨 🪟 Full Windows support 🎼 Extra long samples 🐞 Huge amount of bug fixes We should be the fastest Whisper implementation on Windows now 🔥 Check it out: (And we secured the .com 🔐)

2

0

12

Fleetwood

@fleetwood___

11 months

Now that whisper-turbo is stable I'm excited to move on to other models. Ratchet intends to explore what cross-platform ML looks like, with a focus on DX. I'm looking for collaborators - my subpar GPU programming skills is leaving minimum 5x speed on the table.

1

0

13

Fleetwood

@fleetwood___

1 year

Just released whisper-turbo v0.3.0 featuring: 🎵 Audio transcoding to enable transcription of mp3, mp4, m4a, wav and aac files. 🎼 Support for samples longer than 30 seconds (WIP) ⭐️ the repo to follow along as we approach beta

1

2

12

Fleetwood

@fleetwood___

5 months

Cooked up a proc macro that allows you to embed WGSL in your Rust! Allows passing values from Rust into the WGSL source, makes constructing kernels much easier 🔥

1

2

12

Fleetwood

@fleetwood___

6 months

Humbling when you realise you've lost your grasp on a foundational piece of knowledge. Worst thing to do in that situation is to avoid it. Rotary Embeddings made me work my way back up through complex numbers, rotations & sinusoidal embeddings. @bojone1993 is clearly a genius.

0

11

Fleetwood

@fleetwood___

4 months

Props to @antirez for the best reference on GGUF quants: From arcane to mundane!

gguf-tools/gguflib.c at main · antirez/gguf-tools

GGUF implementation in C as a library and a tools CLI program - antirez/gguf-tools

github.com

0

1

11

Fleetwood

@fleetwood___

1 year

Working on something pretty cool. GGML loader already works. Compute graph is well defined. Only thing left is a mini symbolic algebra library and some fast Q4 shaders.

1

0

11

Fleetwood

@fleetwood___

10 months

New blog post showing how to iteratively arrive at the "fastest" LayerNorm kernel!

0

3

11