mike64_t @mike64_t Twitter profile

Pinned Tweet

mike64_t

2 months

It is done. LibreCUDA can now launch CUDA kernels without relying on the proprietary CUDA runtime / driver api. It does so by communicating directly with the hardware via the ioctl "rm api" and Nvidia's QMD MMIO command queue structure.

9

41

288

Last Seen Profiles

@NL_Resist_Trump

@jluka9

@FoxToons99

@_agustlopez

@HochBranden

@Abdelcards

@RSF25Xhzc4L7S

@ZertoxFTN

@Minallisn

@sumichanmeripo

@grofofsremtrope

@Ward3Matthew

@TaehyungPeru

@Colette3357

@congthoaix

@ChristianvN

@night__yoru

@llotuszz

@AvigailBur53895

@stwmaniax

@ibubohay2

@JasminH4950

@NearthiuKFt

@Yowsa987

@winwinforlyfe

@animalworld995

@RaiPratishtha

@PettyBJustice

@orangana27014

@hyo_phill

@pjmattson

@collins_ma30323

@tatatatc19

@Johannes_PhD

@PT_Muhannad

@reffino_lu22138

mike64_t

@mike64_t

5 months

It always amazes me how this is not the first thing taught in a CS class. It always makes people have Eureka moments whenever I show this to people...

122

190

3K

mike64_t

@mike64_t

2 years

I can't explain how amazing @karpathy 's lectures are. Andrej's lectures are detailed enough that I could not only follow along, but write my own tensor processing + autograd engine in Java+Kotlin & C++ from scratch. And best of all, it's 2x faster than PyTorch! SIMD for the win!

28

138

2K

mike64_t

@mike64_t

11 months

Me at 19 [right now]

IamIronLAN

@IamIronLAN

11 months

me at 19

9

0

82

17

12

526

mike64_t

@mike64_t

10 months

So, neural networks are REALLY robust to errors, as it turns out. Like, I just discovered a bug in my kernel caching logic, where I was computing matmul with completely wrong strides, and it STILL LEARNED. Like... WHAT WAS COMPUTED HERE IS NOT A DERIVATIVE BY ANY MEANS

15

13

415

mike64_t

@mike64_t

2 years

Wake up babe, new lecture from @karpathy just dropped

8

24

243

mike64_t

@mike64_t

4 months

Tangentially related to Leopold Aschenbrenner's episode on Dwarkesh

4

12

152

mike64_t

@mike64_t

11 months

So, it turns out on a 4090, when you do a float32 matmul with tfloat32 compute, you get ~85TFLOPs. HOWEVER, when you build a custom kernel that loads operands in float16, performs the compute in tfloat32 and downcasts it again, you get 173 TFLOPs, which is only slightly less than

4

150

mike64_t

@mike64_t

20 days

You might be leaving performance on the table by not forcing your GPU fan speed to 100% Here's a simple fix:

3

4

132

mike64_t

@mike64_t

1 month

LLVM is really powerful. 250 lines to create a basic but quite functional replacement for NVCC. Most work is replacing the CUDA stdlib, which also isn't that hard. NVPTX generates similar code to NVCC if you force it to, but LLVM will hapily reject optimizations such as loop

2

14

125

mike64_t

@mike64_t

5 months

@fastnfair @ethananam It's not about assembly it's about what a computer actually does. A toy ISA you made up on the spot I would argue is also easier to write than random CISC ISA assembly™

3

0

99

mike64_t

@mike64_t

2 months

@xlr8harder tinygrad code is EXTREMELY efficient to the point that a mortal reader doesn't even grasp a slither of its genius. Eg. qmd_struct_t in runtime/nvops.py Just a normal type, right? WRONG its a type with dynamic attributes generated to set bit-field regions in a logical

5

2

98

mike64_t

@mike64_t

28 days

tritonc - your standalone, python-free, command line Triton compiler (what have I done...)

11

10

96

mike64_t

@mike64_t

5 months

Mom can we get scaling laws? We have scaling laws at home. Scaling laws at home:

4

7

91

mike64_t

@mike64_t

10 months

Makes me wonder what kind of tricks you can do in the name of TFLOP throughput, increasingly becoming more mathematically questionable, while still somehow preserving the ability learn, which may or may not be necessarily a strict gradient descent.

4

2

81

mike64_t

@mike64_t

5 months

(also not me calling the program counter the stack pointer because stupidity)

0

77

mike64_t

@mike64_t

10 days

Starting to see the pattern how Nvidia SASS instructions are encoded. Once you understand one instruction, it's smooth sailing.

2

3

74

mike64_t

@mike64_t

2 months

@justalexoki Every generation needs to learn the just barely functional technology. Two generations ago that was cars. If you didn't know how to do basic car repair, you were screwed. I don't know how to repair cars... But I do know how to code... However, I sort of refuse to believe

9

0

71

mike64_t

@mike64_t

2 months

llm.c optimization progress: Stock: 142.5k tokens/s/GPU Prev.: 157.5k tokens/s/GPU Current: 180.1k tokens/s/GPU

mike64_t

@mike64_t

2 months

Day one of optimizing llm.c for the 4090 Baseline: 142.5k tokens/s/GPU New: 157.5k tokens/s/GPU

2

0

21

5

3

67

mike64_t

@mike64_t

5 months

In case you are wondering at the LOAD 200, 0 JUMP 0 part of the program, this basically just exploits the fact that the memory is initialized with zeros and we can get a zero by just loading some address that hasn't been touched yet. Jumping to that value, will reset the program

2

0

51

mike64_t

@mike64_t

10 months

@mynamebedan Who doesn't?

4

1

52

mike64_t

@mike64_t

5 months

@JuneSYi A python script with an array we pretend is RAM that acts like a CPU to read instructions from that RAM which we then execute to write results back into it. It's still very close to the turing machine, but it also resembles actual computers. It makes you see that this weird head

0

1

50

mike64_t

@mike64_t

19 days

Can GPT-O1-preview write a performant matrix multiplication kernel? The answer is no. I almost used up my quota trying. It constantly keeps ignoring the most blatant problems and it has to be explicitly told to focus on them, else this train of thought is going nowhere. It has no

4

2

51

mike64_t

@mike64_t

30 days

You can now launch a Cutlass kernel from LibreCuda. Have fun making your gpu go brrrr

0

4

51

mike64_t

@mike64_t

3 months

Wellp... That's not a 4090... This was sold by Amazon itself btw... Are 4090s being stolen and replaced straight from the warehouses?

7

0

50

mike64_t

@mike64_t

5 months

@thebadcode The meta aspect is a good point. Maybe not THE first thing, but it has to be before you start wondering what the hell the operating system is doing.

4

2

44

mike64_t

@mike64_t

2 years

@anri_m_lombard @karpathy Andrej Karpathy's Zero to Hero series on YouTube

2

1

44

mike64_t

@mike64_t

5 months

@TBagsMusings here you go:

cpu.py

GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

0

2

38

mike64_t

@mike64_t

5 months

@realvantran You can abstract, that's not the point. You can even implement C in that instruction set if you want. The point is that you are shown the simple powerful principle from which you can explain it all. I know that some people do not like having to mentally enumerate the vast number

1

35

mike64_t

@mike64_t

6 months

Lots of debugging later you can now finally implement GPT-like models in my DL library. Just compiler generated Triton kernels, so no flash attention (ish thing) just yet. 146 out of 320 TFLOP/s on a 4090 isn't something I'm even close to happy with, but it's a start... (1/3)

4

3

33

mike64_t

@mike64_t

27 days

@RaghuGanti @AIatMeta @IBMResearch @hsu_byron You would still need the cuda API to launch the Triton kernel - which is at the end of the day just a cuda kernel, just not compiled by nvcc. But there is LibreCUDA which actually talks to the driver directly without any proprietary software. So expect for ptxas, which still

1

0

31

mike64_t

@mike64_t

25 days

LibreCUDA now supports CUDA Events so you can finally benchmark kernels with GPU-side time again

1

29

mike64_t

@mike64_t

9 days

@francoisfleuret It's not about elegance, its about power. This idea that everything has to be "elegant" is just one reason modern software is going to shit. How about aiming for optimality as opposed to elegance for once.

8

1

28

mike64_t

@mike64_t

2 years

@edward_the6 Basically useless. Detects human input as fake, and generated input as real basically at random. Perplexity is not a very good metric to detect whether AI actually generated a piece of text. This is an unwinnable battle.

1

0

25

mike64_t

@mike64_t

10 months

@doodlestein Yes, but what reason do you have to attribute these characteristics to matrix multiplication? I already posted an Interpretation of what I think this bug effectively does in this thread. It's not beyond the ream of interpretability that you have to resort to vague analogs.

2

0

25

mike64_t

@mike64_t

11 months

doing a float16 matmul with float32 accumulator (which is what every major framework including pytorch does). Except that they of course don't use tf32 and you leave valuable precision on the table for precisely no speed gain.

2

0

26

mike64_t

@mike64_t

22 days

If you thought Triton MLIR was cursed, watch the preprocessor I'm building:

1

2

24

mike64_t

@mike64_t

5 months

@0xKeef Everything you see your computer do can be explained if you accept the following: - Your computer has some memory and means to manipulate it. - You can do it meaningful ways to make a computer achieve some task - All you need are the following operations to compute everything

1

2

24

mike64_t

@mike64_t

1 month

LibreCUDA now supports dynamic shared memory. Now you can really launch a Triton kernel.

1

24

mike64_t

@mike64_t

11 months

Yes, tfloat32 is in practice more like a float18, but still, when you basically leave no speed on the table for more precision, you might as well just take it.

0

24

mike64_t

@mike64_t

1 year

I am about to walk into the most abundant, balanced, and wealthy period of my life. I give myself permission to prosper.

2

0

24

mike64_t

@mike64_t

5 months

@fastnfair @ethananam Because you're looking at it wrong

2

0

20

mike64_t

@mike64_t

7 months

Mandatory military service finally over which means I can finally really get back to ML. University starts next week, so fortunately no gap at all. But hey... at least I got a medal now and a rank equivalent to "private"... 🙃

3

1

23

mike64_t

@mike64_t

2 months

Day one of optimizing llm.c for the 4090 Baseline: 142.5k tokens/s/GPU New: 157.5k tokens/s/GPU

2

0

21

mike64_t

@mike64_t

2 years

@GreatFate4 @karpathy You don't have to believe me, it's open source. ./tests/pytorch/makemore_pytorch.py is the pytorch version, ./tests/src/test/java/me/mikex86/scicore/tests/makemore/MakeMoreTrainingTest.kt is my version.

GitHub - mikex86/scicore: A tiny deep learning library written in Java

A tiny deep learning library written in Java. Contribute to mikex86/scicore development by creating an account on GitHub.

github.com

2

4

20

mike64_t

@mike64_t

2 months

@YerielZamora @justalexoki It has become very stable at being an overblown chrome bootloader.

2

0

20

mike64_t

@mike64_t

9 months

Got a job offer from a startup with the goal of "developing safe AGI". Also just heard back that I got accepted into university. Now I have to make a choice... Yes, this job sounds exciting, but I also feel like I don't want to turn off my curiosity just yet and do just one thing

12

1

20

mike64_t

@mike64_t

2 months

@doomslide @xlr8harder because this is the least of your worries

4

0

20

mike64_t

@mike64_t

11 months

Searching for ML jobs in Austria is honestly depressing... I'm slowly starting to realize that if I stay in Austria, I just can't do what I love for a living.

3

0

19

mike64_t

@mike64_t

22 days

GPT2 forward pass implemented in LibreCuda (pls ignore shit MFU for now)

0

1

20

mike64_t

@mike64_t

2 months

2x RTX 4090 multi GPU training with hacked geohot driver go brrrrr

1

19

mike64_t

@mike64_t

9 months

I think with that huge for loop I'm probably fuzzing triton more than OpenAI ever bothered to because I'm discovering extremely weird behavior on a daily basis... Those two kernels should be equivalent, right? Well, one works and the other computes utter garbage...

2

1

18

mike64_t

@mike64_t

11 months

@luciascarlet At this point I think I use Linux, macOs and Windows about equally... Linux on the GPU workstation, Windows on my Main PC, macOS on my Laptop. I hate all of them equally at this point.

2

0

16

mike64_t

@mike64_t

2 years

@yacineMTB Taught myself multi variable calculus because I wanted to write an Autograd engine (which was a success). That still doesn't stop me from getting Ds on my math tests which are glorified natural language formula template insert tasks 🤡 I'm conflicted how I should interpret this.

1

0

15

mike64_t

@mike64_t

28 days

GitHub:

GitHub - mikex86/tritonc: Standalone commandline CLI tool for compiling Triton kernels

Standalone commandline CLI tool for compiling Triton kernels - mikex86/tritonc

github.com

0

1

15

mike64_t

@mike64_t

2 months

@secemp9 already is:

GitHub - mikex86/LibreCuda

Contribute to mikex86/LibreCuda development by creating an account on GitHub.

github.com

1

0

15

mike64_t

@mike64_t

2 years

@CityLab This is why Europe will never truly innovate technology ever again

0

14

mike64_t

@mike64_t

2 months

just repaired my 4090 with a needle because the 600w plug was filled with molten plastic... now I have to use the 4x150w adapter, but my PSU only has 3 left, so I'm stuck at 450w max, which would be its reported TDP, but in practice it spikes way higher. This alone erases ~20k

2

0

14

mike64_t

@mike64_t

11 months

So, do you want to matmul two float16 matmul matrices in TF32 with TF32 accumulation, or maybe compute the dot product of the group in fp32, but accumulate in fp16? You can also load float64's and do the compute in fp32, or tf32 or heck even INT8 if you desire meaningless results

1

0

11

mike64_t

@mike64_t

9 months

I just realized how terrible consumer SSDs are at accepting huge amounts of sequential writes. They fill up their DRAM cache and afterwards they either 1. lock up the system because they need to flush it 2. drop to speeds that an HDD can outperform What the hell

2

1

14

mike64_t

@mike64_t

2 months

Because two 4090s is too much for my last (and replacement) PSU, I've temporarily split the two 4090s among two computers and used it as an excuse to test NCCL all reduce over 10Gbit Ethernet. Adds ~200ms of latency to each step. From 360k total tokens/s down to ~320k tokens/s

3

1

14

mike64_t

@mike64_t

2 months

Basic command queue & synchronization working with no CUDA dependency. Next: upload a SASS ELF binary (aka. cuModuleLoadData)

2

1

13

mike64_t

@mike64_t

10 months

Ladies and gentlemen, SciCore can finally run a basic MNIST MLP compiling fully to specialized Triton kernels. For all my new followers, SciCore is a deep learning library written in Kotlin with the goal of achieving optimal performance without propriatary cuda infrastructure.

2

1

13

mike64_t

@mike64_t

2 years

@PR0GRAMMERHUM0R Fixed shit for real this time. (later) Fixed shit for real this time v2

1

0

12

mike64_t

@mike64_t

5 months

@jGaltSwe No I'm saying that you should see what you know to be a computer come to life from a simple program like this. The idea here is "build a comuter from scratch and accidentally invent assembly"

3

0

13

mike64_t

@mike64_t

11 months

How to grill a threadripper pro + 128gb ram + rtx 4090 with blender: grass 🌿

1

0

12

mike64_t

@mike64_t

6 months

The state of AMD GPUs is truly sad. The GPU is idle... Like power state Idle because it's not used at all. > GPU randomly crashes and cannot be detected by the OS again. > OS cannot do fan control. > GPU temperature rises. > Last moment temp protection ramps fan speed to 100%

1

2

12

mike64_t

@mike64_t

1 month

so... apparently you can query your Nvidia GPU for its Memory manufacturer...

2

0

12

mike64_t

@mike64_t

3 months

In case you are wondering why nobody has ever trained an LLM to predict the next action in a terminal to reconstruct git repositories via commit diff intermediates, the answer is because for ~200 repositories, you will fill 2TB of disk space of zip files with a total compression

0

12

mike64_t

@mike64_t

6 months

While this may be a contrived example, most people would expect a 4 hidden layer MLP with 8192 units and batch size 8192 to hit close to 100% of theoretical matmul throughput. But they don't. Neither with jax, nor with pytorch. (pytorch is unfair because float32 acc -> 175TFLOPS)

1

0

12

mike64_t

@mike64_t

1 year

@__tinygrad__ The RTX 4090 has more than 165.2 FP16 Tensor FLOPS with an FP32 accumulator. You can trivially hit that with a basic Triton kernel.

3

0

11

mike64_t

@mike64_t

1 month

@OpenAgentsInc "cuda should be closed source and only licensed to paying companies" too bad that isn't gonna happen

GitHub - mikex86/LibreCuda

Contribute to mikex86/LibreCuda development by creating an account on GitHub.

github.com

3

2

12

mike64_t

@mike64_t

1 month

@crypt0x_0 CMake exists :) It has a "package manager" (FetchContent) And its target system is everything you'd want from a build system.

1

0

11

mike64_t

@mike64_t

9 days

@francoisfleuret First and foremost it has to map well to the hardware. Pure functional programmers point to their elegance of qsort implementations and forget the 10000 allocations happening to keep the illusion up.

4

0

11

mike64_t

@mike64_t

19 days

When evaluating LLMs and you know the answer to the problem, you always need to be careful about the Clever Hans effect. It's real for LLMs too. A single word can leak the solution. "meaningful" here leaks the fact that maybe you shouldn't initialize your input matrices for an

1

0

11

mike64_t

@mike64_t

2 years

Just implemented a tiny GPT2-like model in my own deep learning library. Hope this is a somewhat exciting conclusion for my highschool diploma thesis (yes that's a thing in Austria), where basically I go from the chain rule to the transformer in ~100 pages.

3

0

11

mike64_t

@mike64_t

4 months

It's interesting how often people dunk on Java. Java is ideal for data crunching. Want to tokenize tons of web text? Great. There's a library for that outperforming even Tiktoken and it's not even calling native code. Need some obscure thing and be reasonably fast? There's a

2

0

11

mike64_t

@mike64_t

2 months

Dig deep enough in driver code and kernels are called shaders again

0

11

mike64_t

@mike64_t

27 days

Calling tritonc from cmake

1

11

mike64_t

@mike64_t

10 months

@Lumialities ah yes the 50mb vram conscious mnist model with 80% accuracy will take over the world

0

11

mike64_t

@mike64_t

7 months

@cis_female What...? Then why do these instructions run on my 4090? This definetely uses the tf32 tensor cores of my 4090. Performance is exactly what is reported in the nvidia whitepaper for TF32 matmul performance with tensor cores. And yes, Triton emits them for sm_89

2

0

10

mike64_t

@mike64_t

1 month

People like to bash on Cmake, but no other build system seems equally flexible. With a bit of cmake you can write the an add_kernel and target_package_kernel function that compile a cuda kernel and package the elf binary in your data section ready for you to you to call

1

0

10

mike64_t

@mike64_t

2 years

@BenTheEgg @karpathy The point here was not to build a production ready system... Before I built this thing I had no clue what the hell a neural network even is 😂 I learn by building stuff. I learn how game engines work by learning low level graphics programming and building one. Same goes for ML.

2

1

10

mike64_t

@mike64_t

8 days

Allow me to shatter your world-view: This is UI is Java Swing.

mike64_t

@mike64_t

8 days

still waiting for vim bros to discover IDEs written in Java Swing with more features than all their 10 mio vim scripts combined

0

3

1

10

mike64_t

@mike64_t

1 year

@ykilcher Impressive, but that must be fixed ;) OA can't possibly be more boring than something that has been RLHFd to death.

1

0

10

mike64_t

@mike64_t

1 month

Ever wanted to launch a kernel on a stream without a strict dependency on the previous without creating a second stream? Turns out the GPU FIFO can happily provide that. So why not add a feature to CUDA when you get the ability to rewrite it. Launch a bunch of kernels in parallel

4

0

10

mike64_t

@mike64_t

4 months

AI: Exists Nobody: Europeans: jawoll now we can wörk even less

2

1

10

mike64_t

@mike64_t

1 year

If you thought installing CUDA was hard, try installing ROCm...

1

0

9

mike64_t

@mike64_t

10 days

You can tell a lot just from understanding the nature of bugs. This weird space before the ; is an artifact of an optionally present argument. If the space is missing, the instruction doesn't have it.

1

0

8

mike64_t

@mike64_t

20 days

want to turn off your jet engine again? here's fan_un_brr.py

0

9

mike64_t

@mike64_t

22 days

Summoning the autotuner from hell to discover obscure constants nobody dared to grid search across before...

1

0

9

mike64_t

@mike64_t

2 months

mike64_t

@mike64_t

2 months

Dig deep enough in driver code and kernels are called shaders again

0

11

1

0

9

mike64_t

@mike64_t

2 months

So... My power supply just started smelling like molten plastic... Seems like two 4090s is too much for a 1200w power supply...

1

0

9

mike64_t

@mike64_t

1 month

Tricks for building hyper-optimized approximations: - Gradient optimize suitable expressions to fit your desired function (eg. in torch), then implement a fast kernel with equivalent operations - Use "high level" PTX intrinsics like tanh.approx.f16 whenever you can - Either

0

2

8

mike64_t

@mike64_t

2 years

@anri_m_lombard @karpathy Thanks! The programming languages are not the big deal here though xD The big aha moment actually was realizing what a DL library like tensorflow/pytorch >REALLY< is. If you figure that out, you know how to write one yourself. And that is what I wanted to prove to myself here.

1

0

8

mike64_t

@mike64_t

11 months

@parthshama1996 The Austrian military? I mean they commissioned this gun's development. Seems like it's good enough for the Australians too. Legend has it someone misread the country of a shipping and the Australians gladly took it ;) Also: it's called the STURMGEWEHR 77

0

1

7

mike64_t

@mike64_t

10 days

@davorVDR There are patterns that span across all instructions i've tested: @(!)P is encoded at bit pos 12 p_neg bit flag at 15, etc. There seem to be "ur" and "non ur" instructions. All "ur" instructions seem to have bit 91 set. Ur instructions can perform offset calculations with

2

0

7

mike64_t

@mike64_t

1 year

Ladies and gentlemen, it is done. 320 TFLOPS for a float16 matmul with float16 accumulator. Cutlass is at 300 TFLOPS for same m,n,k, transposition states and data type. Hardware: RTX 4090 with OC (core +200Mhz, mem +100Mhz)

1

0

8

mike64_t

@mike64_t

1 year

Building a workstation with 2 GPUs (RTX 4090, RX 7800XT). Gotta support both platforms

0

7

mike64_t

@mike64_t

10 months

@doodlestein I agree that evolution definitely arrived at such an architecture, and that a TRAINED neural network might exhibit similar characters especially when using dropout, but it doesn't make sense that something of the form of uniform random values already exhibits this.

1

0

8