mike64_t Profile Banner
mike64_t Profile
mike64_t

@mike64_t

1,722
Followers
239
Following
195
Media
1,808
Statuses

descending the gradient

Joined October 2022
Don't wanna be here? Send us removal request.
Pinned Tweet
@mike64_t
mike64_t
2 months
It is done. LibreCUDA can now launch CUDA kernels without relying on the proprietary CUDA runtime / driver api. It does so by communicating directly with the hardware via the ioctl "rm api" and Nvidia's QMD MMIO command queue structure.
Tweet media one
Tweet media two
Tweet media three
9
41
288
@mike64_t
mike64_t
5 months
It always amazes me how this is not the first thing taught in a CS class. It always makes people have Eureka moments whenever I show this to people...
Tweet media one
Tweet media two
Tweet media three
Tweet media four
122
190
3K
@mike64_t
mike64_t
2 years
I can't explain how amazing @karpathy 's lectures are. Andrej's lectures are detailed enough that I could not only follow along, but write my own tensor processing + autograd engine in Java+Kotlin & C++ from scratch. And best of all, it's 2x faster than PyTorch! SIMD for the win!
Tweet media one
Tweet media two
Tweet media three
Tweet media four
28
138
2K
@mike64_t
mike64_t
11 months
Me at 19 [right now]
Tweet media one
Tweet media two
Tweet media three
Tweet media four
@IamIronLAN
IamIronLAN
11 months
me at 19
Tweet media one
Tweet media two
Tweet media three
Tweet media four
9
0
82
17
12
526
@mike64_t
mike64_t
10 months
So, neural networks are REALLY robust to errors, as it turns out. Like, I just discovered a bug in my kernel caching logic, where I was computing matmul with completely wrong strides, and it STILL LEARNED. Like... WHAT WAS COMPUTED HERE IS NOT A DERIVATIVE BY ANY MEANS
15
13
415
@mike64_t
mike64_t
2 years
Wake up babe, new lecture from @karpathy just dropped
8
24
243
@mike64_t
mike64_t
4 months
Tangentially related to Leopold Aschenbrenner's episode on Dwarkesh
4
12
152
@mike64_t
mike64_t
11 months
So, it turns out on a 4090, when you do a float32 matmul with tfloat32 compute, you get ~85TFLOPs. HOWEVER, when you build a custom kernel that loads operands in float16, performs the compute in tfloat32 and downcasts it again, you get 173 TFLOPs, which is only slightly less than
4
4
150
@mike64_t
mike64_t
20 days
You might be leaving performance on the table by not forcing your GPU fan speed to 100% Here's a simple fix:
Tweet media one
3
4
132
@mike64_t
mike64_t
1 month
LLVM is really powerful. 250 lines to create a basic but quite functional replacement for NVCC. Most work is replacing the CUDA stdlib, which also isn't that hard. NVPTX generates similar code to NVCC if you force it to, but LLVM will hapily reject optimizations such as loop
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
14
125
@mike64_t
mike64_t
5 months
@fastnfair @ethananam It's not about assembly it's about what a computer actually does. A toy ISA you made up on the spot I would argue is also easier to write than random CISC ISA assembly™
3
0
99
@mike64_t
mike64_t
2 months
@xlr8harder tinygrad code is EXTREMELY efficient to the point that a mortal reader doesn't even grasp a slither of its genius. Eg. qmd_struct_t in runtime/nvops.py Just a normal type, right? WRONG its a type with dynamic attributes generated to set bit-field regions in a logical
5
2
98
@mike64_t
mike64_t
28 days
tritonc - your standalone, python-free, command line Triton compiler (what have I done...)
11
10
96
@mike64_t
mike64_t
5 months
Mom can we get scaling laws? We have scaling laws at home. Scaling laws at home:
Tweet media one
4
7
91
@mike64_t
mike64_t
10 months
Makes me wonder what kind of tricks you can do in the name of TFLOP throughput, increasingly becoming more mathematically questionable, while still somehow preserving the ability learn, which may or may not be necessarily a strict gradient descent.
4
2
81
@mike64_t
mike64_t
5 months
(also not me calling the program counter the stack pointer because stupidity)
0
0
77
@mike64_t
mike64_t
10 days
Starting to see the pattern how Nvidia SASS instructions are encoded. Once you understand one instruction, it's smooth sailing.
Tweet media one
Tweet media two
2
3
74
@mike64_t
mike64_t
2 months
@justalexoki Every generation needs to learn the just barely functional technology. Two generations ago that was cars. If you didn't know how to do basic car repair, you were screwed. I don't know how to repair cars... But I do know how to code... However, I sort of refuse to believe
9
0
71
@mike64_t
mike64_t
2 months
llm.c optimization progress: Stock: 142.5k tokens/s/GPU Prev.: 157.5k tokens/s/GPU Current: 180.1k tokens/s/GPU
Tweet media one
@mike64_t
mike64_t
2 months
Day one of optimizing llm.c for the 4090 Baseline: 142.5k tokens/s/GPU New: 157.5k tokens/s/GPU
Tweet media one
2
0
21
5
3
67
@mike64_t
mike64_t
5 months
In case you are wondering at the LOAD 200, 0 JUMP 0 part of the program, this basically just exploits the fact that the memory is initialized with zeros and we can get a zero by just loading some address that hasn't been touched yet. Jumping to that value, will reset the program
2
0
51
@mike64_t
mike64_t
10 months
@mynamebedan Who doesn't?
Tweet media one
Tweet media two
4
1
52
@mike64_t
mike64_t
5 months
@JuneSYi A python script with an array we pretend is RAM that acts like a CPU to read instructions from that RAM which we then execute to write results back into it. It's still very close to the turing machine, but it also resembles actual computers. It makes you see that this weird head
0
1
50
@mike64_t
mike64_t
19 days
Can GPT-O1-preview write a performant matrix multiplication kernel? The answer is no. I almost used up my quota trying. It constantly keeps ignoring the most blatant problems and it has to be explicitly told to focus on them, else this train of thought is going nowhere. It has no
Tweet media one
4
2
51
@mike64_t
mike64_t
30 days
You can now launch a Cutlass kernel from LibreCuda. Have fun making your gpu go brrrr
Tweet media one
Tweet media two
0
4
51
@mike64_t
mike64_t
3 months
Wellp... That's not a 4090... This was sold by Amazon itself btw... Are 4090s being stolen and replaced straight from the warehouses?
Tweet media one
Tweet media two
7
0
50
@mike64_t
mike64_t
5 months
@thebadcode The meta aspect is a good point. Maybe not THE first thing, but it has to be before you start wondering what the hell the operating system is doing.
4
2
44
@mike64_t
mike64_t
2 years
@anri_m_lombard @karpathy Andrej Karpathy's Zero to Hero series on YouTube
2
1
44
@mike64_t
mike64_t
5 months
@realvantran You can abstract, that's not the point. You can even implement C in that instruction set if you want. The point is that you are shown the simple powerful principle from which you can explain it all. I know that some people do not like having to mentally enumerate the vast number
1
1
35
@mike64_t
mike64_t
6 months
Lots of debugging later you can now finally implement GPT-like models in my DL library. Just compiler generated Triton kernels, so no flash attention (ish thing) just yet. 146 out of 320 TFLOP/s on a 4090 isn't something I'm even close to happy with, but it's a start... (1/3)
Tweet media one
Tweet media two
Tweet media three
Tweet media four
4
3
33
@mike64_t
mike64_t
27 days
@RaghuGanti @AIatMeta @IBMResearch @hsu_byron You would still need the cuda API to launch the Triton kernel - which is at the end of the day just a cuda kernel, just not compiled by nvcc. But there is LibreCUDA which actually talks to the driver directly without any proprietary software. So expect for ptxas, which still
1
0
31
@mike64_t
mike64_t
25 days
LibreCUDA now supports CUDA Events so you can finally benchmark kernels with GPU-side time again
Tweet media one
Tweet media two
1
1
29
@mike64_t
mike64_t
9 days
@francoisfleuret It's not about elegance, its about power. This idea that everything has to be "elegant" is just one reason modern software is going to shit. How about aiming for optimality as opposed to elegance for once.
8
1
28
@mike64_t
mike64_t
2 years
@edward_the6 Basically useless. Detects human input as fake, and generated input as real basically at random. Perplexity is not a very good metric to detect whether AI actually generated a piece of text. This is an unwinnable battle.
1
0
25
@mike64_t
mike64_t
10 months
@doodlestein Yes, but what reason do you have to attribute these characteristics to matrix multiplication? I already posted an Interpretation of what I think this bug effectively does in this thread. It's not beyond the ream of interpretability that you have to resort to vague analogs.
2
0
25
@mike64_t
mike64_t
11 months
doing a float16 matmul with float32 accumulator (which is what every major framework including pytorch does). Except that they of course don't use tf32 and you leave valuable precision on the table for precisely no speed gain.
2
0
26
@mike64_t
mike64_t
22 days
If you thought Triton MLIR was cursed, watch the preprocessor I'm building:
Tweet media one
1
2
24
@mike64_t
mike64_t
5 months
@0xKeef Everything you see your computer do can be explained if you accept the following: - Your computer has some memory and means to manipulate it. - You can do it meaningful ways to make a computer achieve some task - All you need are the following operations to compute everything
1
2
24
@mike64_t
mike64_t
1 month
LibreCUDA now supports dynamic shared memory. Now you can really launch a Triton kernel.
Tweet media one
Tweet media two
Tweet media three
1
1
24
@mike64_t
mike64_t
11 months
Yes, tfloat32 is in practice more like a float18, but still, when you basically leave no speed on the table for more precision, you might as well just take it.
0
0
24
@mike64_t
mike64_t
1 year
I am about to walk into the most abundant, balanced, and wealthy period of my life. I give myself permission to prosper.
Tweet media one
Tweet media two
Tweet media three
2
0
24
@mike64_t
mike64_t
5 months
@fastnfair @ethananam Because you're looking at it wrong
2
0
20
@mike64_t
mike64_t
7 months
Mandatory military service finally over which means I can finally really get back to ML. University starts next week, so fortunately no gap at all. But hey... at least I got a medal now and a rank equivalent to "private"... 🙃
Tweet media one
3
1
23
@mike64_t
mike64_t
2 months
Day one of optimizing llm.c for the 4090 Baseline: 142.5k tokens/s/GPU New: 157.5k tokens/s/GPU
Tweet media one
2
0
21
@mike64_t
mike64_t
2 years
@GreatFate4 @karpathy You don't have to believe me, it's open source. ./tests/pytorch/makemore_pytorch.py is the pytorch version, ./tests/src/test/java/me/mikex86/scicore/tests/makemore/MakeMoreTrainingTest.kt is my version.
2
4
20
@mike64_t
mike64_t
2 months
@YerielZamora @justalexoki It has become very stable at being an overblown chrome bootloader.
2
0
20
@mike64_t
mike64_t
9 months
Got a job offer from a startup with the goal of "developing safe AGI". Also just heard back that I got accepted into university. Now I have to make a choice... Yes, this job sounds exciting, but I also feel like I don't want to turn off my curiosity just yet and do just one thing
12
1
20
@mike64_t
mike64_t
2 months
@doomslide @xlr8harder because this is the least of your worries
4
0
20
@mike64_t
mike64_t
11 months
Searching for ML jobs in Austria is honestly depressing... I'm slowly starting to realize that if I stay in Austria, I just can't do what I love for a living.
3
0
19
@mike64_t
mike64_t
22 days
GPT2 forward pass implemented in LibreCuda (pls ignore shit MFU for now)
Tweet media one
Tweet media two
0
1
20
@mike64_t
mike64_t
2 months
2x RTX 4090 multi GPU training with hacked geohot driver go brrrrr
Tweet media one
Tweet media two
Tweet media three
1
1
19
@mike64_t
mike64_t
9 months
I think with that huge for loop I'm probably fuzzing triton more than OpenAI ever bothered to because I'm discovering extremely weird behavior on a daily basis... Those two kernels should be equivalent, right? Well, one works and the other computes utter garbage...
Tweet media one
Tweet media two
Tweet media three
2
1
18
@mike64_t
mike64_t
11 months
@luciascarlet At this point I think I use Linux, macOs and Windows about equally... Linux on the GPU workstation, Windows on my Main PC, macOS on my Laptop. I hate all of them equally at this point.
2
0
16
@mike64_t
mike64_t
2 years
@yacineMTB Taught myself multi variable calculus because I wanted to write an Autograd engine (which was a success). That still doesn't stop me from getting Ds on my math tests which are glorified natural language formula template insert tasks 🤡 I'm conflicted how I should interpret this.
1
0
15
@mike64_t
mike64_t
2 years
@CityLab This is why Europe will never truly innovate technology ever again
0
0
14
@mike64_t
mike64_t
2 months
just repaired my 4090 with a needle because the 600w plug was filled with molten plastic... now I have to use the 4x150w adapter, but my PSU only has 3 left, so I'm stuck at 450w max, which would be its reported TDP, but in practice it spikes way higher. This alone erases ~20k
2
0
14
@mike64_t
mike64_t
11 months
So, do you want to matmul two float16 matmul matrices in TF32 with TF32 accumulation, or maybe compute the dot product of the group in fp32, but accumulate in fp16? You can also load float64's and do the compute in fp32, or tf32 or heck even INT8 if you desire meaningless results
Tweet media one
1
0
11
@mike64_t
mike64_t
9 months
I just realized how terrible consumer SSDs are at accepting huge amounts of sequential writes. They fill up their DRAM cache and afterwards they either 1. lock up the system because they need to flush it 2. drop to speeds that an HDD can outperform What the hell
2
1
14
@mike64_t
mike64_t
2 months
Because two 4090s is too much for my last (and replacement) PSU, I've temporarily split the two 4090s among two computers and used it as an excuse to test NCCL all reduce over 10Gbit Ethernet. Adds ~200ms of latency to each step. From 360k total tokens/s down to ~320k tokens/s
Tweet media one
3
1
14
@mike64_t
mike64_t
2 months
Basic command queue & synchronization working with no CUDA dependency. Next: upload a SASS ELF binary (aka. cuModuleLoadData)
Tweet media one
Tweet media two
Tweet media three
2
1
13
@mike64_t
mike64_t
10 months
Ladies and gentlemen, SciCore can finally run a basic MNIST MLP compiling fully to specialized Triton kernels. For all my new followers, SciCore is a deep learning library written in Kotlin with the goal of achieving optimal performance without propriatary cuda infrastructure.
Tweet media one
2
1
13
@mike64_t
mike64_t
2 years
@PR0GRAMMERHUM0R Fixed shit for real this time. (later) Fixed shit for real this time v2
1
0
12
@mike64_t
mike64_t
5 months
@jGaltSwe No I'm saying that you should see what you know to be a computer come to life from a simple program like this. The idea here is "build a comuter from scratch and accidentally invent assembly"
3
0
13
@mike64_t
mike64_t
11 months
How to grill a threadripper pro + 128gb ram + rtx 4090 with blender: grass 🌿
Tweet media one
Tweet media two
1
0
12
@mike64_t
mike64_t
6 months
The state of AMD GPUs is truly sad. The GPU is idle... Like power state Idle because it's not used at all. > GPU randomly crashes and cannot be detected by the OS again. > OS cannot do fan control. > GPU temperature rises. > Last moment temp protection ramps fan speed to 100%
1
2
12
@mike64_t
mike64_t
1 month
so... apparently you can query your Nvidia GPU for its Memory manufacturer...
Tweet media one
2
0
12
@mike64_t
mike64_t
3 months
In case you are wondering why nobody has ever trained an LLM to predict the next action in a terminal to reconstruct git repositories via commit diff intermediates, the answer is because for ~200 repositories, you will fill 2TB of disk space of zip files with a total compression
Tweet media one
Tweet media two
0
0
12
@mike64_t
mike64_t
6 months
While this may be a contrived example, most people would expect a 4 hidden layer MLP with 8192 units and batch size 8192 to hit close to 100% of theoretical matmul throughput. But they don't. Neither with jax, nor with pytorch. (pytorch is unfair because float32 acc -> 175TFLOPS)
Tweet media one
1
0
12
@mike64_t
mike64_t
1 year
@__tinygrad__ The RTX 4090 has more than 165.2 FP16 Tensor FLOPS with an FP32 accumulator. You can trivially hit that with a basic Triton kernel.
Tweet media one
3
0
11
@mike64_t
mike64_t
1 month
@OpenAgentsInc "cuda should be closed source and only licensed to paying companies" too bad that isn't gonna happen
3
2
12
@mike64_t
mike64_t
1 month
@crypt0x_0 CMake exists :) It has a "package manager" (FetchContent) And its target system is everything you'd want from a build system.
1
0
11
@mike64_t
mike64_t
9 days
@francoisfleuret First and foremost it has to map well to the hardware. Pure functional programmers point to their elegance of qsort implementations and forget the 10000 allocations happening to keep the illusion up.
4
0
11
@mike64_t
mike64_t
19 days
When evaluating LLMs and you know the answer to the problem, you always need to be careful about the Clever Hans effect. It's real for LLMs too. A single word can leak the solution. "meaningful" here leaks the fact that maybe you shouldn't initialize your input matrices for an
Tweet media one
1
0
11
@mike64_t
mike64_t
2 years
Just implemented a tiny GPT2-like model in my own deep learning library. Hope this is a somewhat exciting conclusion for my highschool diploma thesis (yes that's a thing in Austria), where basically I go from the chain rule to the transformer in ~100 pages.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
0
11
@mike64_t
mike64_t
4 months
It's interesting how often people dunk on Java. Java is ideal for data crunching. Want to tokenize tons of web text? Great. There's a library for that outperforming even Tiktoken and it's not even calling native code. Need some obscure thing and be reasonably fast? There's a
2
0
11
@mike64_t
mike64_t
2 months
Dig deep enough in driver code and kernels are called shaders again
0
0
11
@mike64_t
mike64_t
27 days
Calling tritonc from cmake
1
1
11
@mike64_t
mike64_t
10 months
@Lumialities ah yes the 50mb vram conscious mnist model with 80% accuracy will take over the world
0
0
11
@mike64_t
mike64_t
7 months
@cis_female What...? Then why do these instructions run on my 4090? This definetely uses the tf32 tensor cores of my 4090. Performance is exactly what is reported in the nvidia whitepaper for TF32 matmul performance with tensor cores. And yes, Triton emits them for sm_89
Tweet media one
Tweet media two
2
0
10
@mike64_t
mike64_t
1 month
People like to bash on Cmake, but no other build system seems equally flexible. With a bit of cmake you can write the an add_kernel and target_package_kernel function that compile a cuda kernel and package the elf binary in your data section ready for you to you to call
Tweet media one
Tweet media two
Tweet media three
1
0
10
@mike64_t
mike64_t
2 years
@BenTheEgg @karpathy The point here was not to build a production ready system... Before I built this thing I had no clue what the hell a neural network even is 😂 I learn by building stuff. I learn how game engines work by learning low level graphics programming and building one. Same goes for ML.
2
1
10
@mike64_t
mike64_t
8 days
Allow me to shatter your world-view: This is UI is Java Swing.
Tweet media one
@mike64_t
mike64_t
8 days
still waiting for vim bros to discover IDEs written in Java Swing with more features than all their 10 mio vim scripts combined
0
0
3
3
1
10
@mike64_t
mike64_t
1 year
@ykilcher Impressive, but that must be fixed ;) OA can't possibly be more boring than something that has been RLHFd to death.
Tweet media one
1
0
10
@mike64_t
mike64_t
1 month
Ever wanted to launch a kernel on a stream without a strict dependency on the previous without creating a second stream? Turns out the GPU FIFO can happily provide that. So why not add a feature to CUDA when you get the ability to rewrite it. Launch a bunch of kernels in parallel
Tweet media one
4
0
10
@mike64_t
mike64_t
4 months
AI: Exists Nobody: Europeans: jawoll now we can wörk even less
Tweet media one
2
1
10
@mike64_t
mike64_t
1 year
If you thought installing CUDA was hard, try installing ROCm...
Tweet media one
1
0
9
@mike64_t
mike64_t
10 days
You can tell a lot just from understanding the nature of bugs. This weird space before the ; is an artifact of an optionally present argument. If the space is missing, the instruction doesn't have it.
Tweet media one
1
0
8
@mike64_t
mike64_t
20 days
want to turn off your jet engine again? here's fan_un_brr.py
Tweet media one
0
0
9
@mike64_t
mike64_t
22 days
Summoning the autotuner from hell to discover obscure constants nobody dared to grid search across before...
Tweet media one
Tweet media two
1
0
9
@mike64_t
mike64_t
2 months
Tweet media one
@mike64_t
mike64_t
2 months
Dig deep enough in driver code and kernels are called shaders again
0
0
11
1
0
9
@mike64_t
mike64_t
2 months
So... My power supply just started smelling like molten plastic... Seems like two 4090s is too much for a 1200w power supply...
1
0
9
@mike64_t
mike64_t
1 month
Tricks for building hyper-optimized approximations: - Gradient optimize suitable expressions to fit your desired function (eg. in torch), then implement a fast kernel with equivalent operations - Use "high level" PTX intrinsics like tanh.approx.f16 whenever you can - Either
0
2
8
@mike64_t
mike64_t
2 years
@anri_m_lombard @karpathy Thanks! The programming languages are not the big deal here though xD The big aha moment actually was realizing what a DL library like tensorflow/pytorch >REALLY< is. If you figure that out, you know how to write one yourself. And that is what I wanted to prove to myself here.
1
0
8
@mike64_t
mike64_t
11 months
@parthshama1996 The Austrian military? I mean they commissioned this gun's development. Seems like it's good enough for the Australians too. Legend has it someone misread the country of a shipping and the Australians gladly took it ;) Also: it's called the STURMGEWEHR 77
0
1
7
@mike64_t
mike64_t
10 days
@davorVDR There are patterns that span across all instructions i've tested: @(!)P is encoded at bit pos 12 p_neg bit flag at 15, etc. There seem to be "ur" and "non ur" instructions. All "ur" instructions seem to have bit 91 set. Ur instructions can perform offset calculations with
2
0
7
@mike64_t
mike64_t
1 year
Ladies and gentlemen, it is done. 320 TFLOPS for a float16 matmul with float16 accumulator. Cutlass is at 300 TFLOPS for same m,n,k, transposition states and data type. Hardware: RTX 4090 with OC (core +200Mhz, mem +100Mhz)
Tweet media one
Tweet media two
1
0
8
@mike64_t
mike64_t
1 year
Building a workstation with 2 GPUs (RTX 4090, RX 7800XT). Gotta support both platforms
0
0
7
@mike64_t
mike64_t
10 months
@doodlestein I agree that evolution definitely arrived at such an architecture, and that a TRAINED neural network might exhibit similar characters especially when using dropout, but it doesn't make sense that something of the form of uniform random values already exhibits this.
1
0
8