Kuter Dinel @KuterDinel Twitter profile

Pinned Tweet

Kuter Dinel

@KuterDinel

19 days

Here is the NVIDIA 4090(sm89) instruction Set. Please share

32

240

2K

Last Seen Profiles

@stravanator_dan

@MylesEvans61357

@biyoocute

@ManuclearBomb

@imam_bebe

@e_tacn

@DLeeeeah

@zuffleroo

@nickravanelli

@USAHindiMein

@NJCAABasketball

@the_zada

@EducationESP

@Jg10ve

@CouserCouse

@kc_kimbrough

@bububu55390582

@daveysec

@arabb_selow

@GreekFestMKE

@BlackYao6

@xm5327515040109

@OctavioAcosta

@stw_pdg

@chanynoel

@zerilera

@terrybali

@mitchwinehouse

@JessseFromThaO

@HadarGil

@galery_basah10

@AndrewDufort

@sealwool

@VibinApesCollab

@hoopseen

@BasketCol

Kuter Dinel

@KuterDinel

19 days

FYI. I am interested in low-level GPU programming job opportunities.

11

523

Kuter Dinel

@KuterDinel

21 days

I am releasing the Nvidia ISA Solver code here under MIT license.

GitHub - kuterd/nv_isa_solver: Nvidia Instruction Set Specification Generator

Nvidia Instruction Set Specification Generator. Contribute to kuterd/nv_isa_solver development by creating an account on GitHub.

github.com

4

47

433

Kuter Dinel

@KuterDinel

19 days

Here is RTX4090 ISA Spec Please retweet Accidentally deleted the last tweet🫠

3

62

295

Kuter Dinel

@KuterDinel

4 months

@thecaptain_nemo An animal caught in a trap will gnaw off its own leg to escape. What will you do?

0

2

208

Kuter Dinel

@KuterDinel

16 days

Thanks a lot for the attention. Several companies reached out. Taking a small break from the NVIDIA RE project to consider different options.

2

182

Kuter Dinel

@KuterDinel

20 days

I published the machine readable ISA for NVIDIA Hopper GPUs. We are currently at 1505 instructions. There are still some others that I will add.

2

9

97

Kuter Dinel

@KuterDinel

9 months

@iammemeloper Real programmers use punch cards

0

3

73

Kuter Dinel

@KuterDinel

18 days

In case you missed the previous tweet here is the ISA docs for NVIDA Hopper.

3

4

62

Kuter Dinel

@KuterDinel

6 months

@0xjprx Wow, really cool that it will go to pass through when the OS crashes. Is this because the r1 chip handles pass through and compositing ?

1

0

56

Kuter Dinel

@KuterDinel

5 months

@wasphyxiation Thou shalt not make a machine in the likeness of a human mind.

0

52

Kuter Dinel

@KuterDinel

10 months

@coffeebreak_YT @innercitypress Next Letter: Dear Judge Kaplan, Due to Mr. Bankman-Fried's lack of access to League Of Legends he has not been able to concentrate at the level he ordinarily would.

1

0

45

Kuter Dinel

@KuterDinel

23 days

Here is the preview for the Nvidia SASS ISA docs I had been working on for the last month. Please share since I don't have much reach on this platform.

2

10

44

Kuter Dinel

@KuterDinel

10 months

@LiveOverflow Oof this hits too close, I spent way too much time digging through V8's source code trying to find anything. Nothing found yet, .... but one day I will.

1

0

28

Kuter Dinel

@KuterDinel

6 months

@RyanMorey @0xjprx I think the real reason is to minimize pass through latency.

2

0

25

Kuter Dinel

@KuterDinel

9 months

@shxf0072 @OpenAI Not your GPU, not your assistant.

1

0

22

Kuter Dinel

@KuterDinel

9 months

@felix_red_panda If true, this has important implications for on device inference. We can all have gpt-3 level models running offline in few years. Future is exciting !

3

0

20

Kuter Dinel

@KuterDinel

19 days

Still need to measure instruction latencies to be able to create a high performance compiler. Which GPU Should I prioritize?

RTX4090

172

Hopper H100

60

3

0

21

Kuter Dinel

@KuterDinel

10 months

@durreadan01 I mean, it can be fun as a gimmick to try. But I don't think there are many active users of these apps. I remember making a similar app in high school when iPhone X came out. Most people download these apps play around a little bit and then move on.

1

0

17

Kuter Dinel

@KuterDinel

10 months

@trunarla This is not specific to JS. Just how floating point numbers work.

2

0

15

Kuter Dinel

@KuterDinel

5 months

@natolambert 'Twitter for h100' 😅

0

14

Kuter Dinel

@KuterDinel

6 months

@prerationalist DALL-E is trained on data scraped from the internet. It learns to generate an image for a given caption. You usually don't write "no elephants" caption in a random photo. People who are sarcastic on images containing elephants might. DALL-E is just "interpolating" its dataset.

2

0

13

Kuter Dinel

@KuterDinel

10 months

@RenwaX23 It's more exciting to hack something that wasn't meant to be hacked !

1

0

13

Kuter Dinel

@KuterDinel

4 months

Just discovered the PyTorch developer podcast. Useful if you are interested in PyTorch internals.

Episodes | PyTorch Developer Podcast

The PyTorch Developer Podcast is a place for the PyTorch dev team to do bite sized (10-20 min) topics about all sorts of internal development topics in PyTorch.

pytorch-dev-podcast.simplecast.com

0

13

Kuter Dinel

@KuterDinel

6 months

@heyeaslo fixed it for you

0

13

Kuter Dinel

@KuterDinel

22 days

@cis_female Yes, my method is based on the algorithm in this paper. However, I made many improvements to the method. My ultimate goal is to create python DSL where you write near assembly low level code (with simple instruction selection, register allocation etc).

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning | Proceedings of...

dl.acm.org

1

0

13

Kuter Dinel

@KuterDinel

10 months

@molly0xFFF I wonder if we will ever see the whole FTX code base.

1

0

11

Kuter Dinel

@KuterDinel

4 months

@Grady_Booch @brent_alvord @OpenAI @Microsoft It took 6 million years to get to the modern human. That's a lot of hyperparameter optimization.

0

11

Kuter Dinel

@KuterDinel

16 days

@sahir2k Epic paper btw. Must read if you want to do anything with SASS

0

12

Kuter Dinel

@KuterDinel

6 months

@liz_love_lace Vision Asahi linux when ?

0

12

Kuter Dinel

@KuterDinel

6 months

@prerationalist This is because of how GPT-4 generates the prompt for DALL-E. I guess there where no verbal logic in the training dataset for generating DALL-E prompts. Should be an easy fix for OA.

4

0

11

Kuter Dinel

@KuterDinel

5 months

@adithyashreshti @sama Probably some sort of attention leak from the prompt, "hamster on its back". Similar things happen with image generation where you ask for a "person with blue eyes" and the model generates a person with blue clothes.

1

0

11

Kuter Dinel

@KuterDinel

10 months

@t3dotgg Created by Fabrice Bellard. He also created QEMU. Oh forgot to mention, he invented a new way to calculate digits of PI.

Fabrice Bellard - Wikipedia

en.m.wikipedia.org

2

0

11

Kuter Dinel

@KuterDinel

6 months

@SethBling This is amazing. Do you use command blocks for the calculations? I remember seeing your basic interpreter in minecraft video. Also, if you are using command blocks. How do you program them?

1

0

10

Kuter Dinel

@KuterDinel

22 days

If you are interested in SASS make sure to take a look at the amazing work done by Nouveau developers

src/nouveau/compiler/nak/ir.rs · main · Mesa / mesa · GitLab

Mesa 3D graphics library

gitlab.freedesktop.org

0

9

Kuter Dinel

@KuterDinel

9 months

@felix_red_panda The comment for retraction is this: `There are some errors in the paper and we need to retract it`. Let's wait if they will publish an updated version.

0

9

Kuter Dinel

@KuterDinel

6 months

@mallocmyheart Checkout this analysis of tensor cores from the perspective of a hardware designer.

Analysis of a Tensor Core

A video analyzing the architectural makeup of an Nvidia Volta Tensor Core.References:Pu, J., et. al. "FPMax: a 106GFLOPS/W at 217GFLOPS/mm2 Single-Precision ...

www.youtube.com

1

9

Kuter Dinel

@KuterDinel

21 days

I have two different ideas for figuring out instruction latencies. Option A) Analise stall values in pre-compiled binaries Option B) Create custom sass sequences and increase stall count until we get the correct value. I think I am going to try option A first.

1

0

9

Kuter Dinel

@KuterDinel

10 months

@an0n_r0 Wow, don't this have huge implications since now we have audio deepfakes ? There where few instances of audio deepfakes being used for phishing already, such as the retool attack. But if you combine it with caller ID spoofing, most people would fall for it IMO.

0

9

Kuter Dinel

@KuterDinel

10 months

@LiveOverflow My goal in life is to become an important enough person that agencies/threat actors use 0-days one me !

1

9

Kuter Dinel

@KuterDinel

10 days

We will figure out instructions latencies by analyzing delays between data dependent instructions, need to look at anti-dependencies as well.

0

9

Kuter Dinel

@KuterDinel

6 months

@fncischen Does the automatic inter puppylery ... ehm interpupillary distance adjustment work on dogs ?

0

8

Kuter Dinel

@KuterDinel

21 days

To build a high performance compiler targeting SASS directly, we need to know precise instruction latencies and throughput to have as little pipeline stalls as possible and avoid data hazards.

0

8

Kuter Dinel

@KuterDinel

10 months

@flyingcircuits Thanks for your response. I hope your data will be safely recovered. Maybe for future missions, it may make sense to have 2 SD cards in raid-1 configuration, ideally 3 with one remaining in the space station as last resort.

1

0

7

Kuter Dinel

@KuterDinel

1 month

Still working on automatically generating documentation for the NVIDIA instruction set architecture. Here are some instructions. Next: Modifier splitting and enumeration.

0

6

Kuter Dinel

@KuterDinel

5 months

@RemiCadene @Tesla Excited to see what kinda of robots huggingface will build!

1

0

6

Kuter Dinel

@KuterDinel

4 months

@dingboard_ Depth and RGB channels are not perfectly alligned but here is a 3d visualization (left is marigold) . Depth anything looks a bit more accurate but I guess it depends on the application Made with

1

0

6

Kuter Dinel

@KuterDinel

9 months

@ValdikSS This is why we need end to end encryption.

1

0

6

Kuter Dinel

@KuterDinel

6 months

@CanadaHonk When I was reading V8's source code, I had the same idea. Programmer anottates types in TS -> Compilation to JS type info stripped -> V8 needs to deduce/ collect stats on types to produce good machine code. And since V8 is speculating code might have get deoptimized.

1

0

2

Kuter Dinel

@KuterDinel

10 months

@LiveOverflow I was an intern last summer, couldn't get a full time offer from them since they are not really hiring graduate SWEs in europe. Also check this out, this guy got root on Google machines by name squating pypi packages.

0

6

Kuter Dinel

@KuterDinel

5 months

@__tinygrad__ Here is what Gemini generated.

2

0

6

Kuter Dinel

@KuterDinel

10 months

@ctjlewis @huggingface They are letting you use a GPU for free, which must be costing them so much. Would love to know what huggingface's burn rate is. Time will show if their strategy will work.

1

0

5

Kuter Dinel

@KuterDinel

25 days

Investigating Warpgroup MMA (Matrix Multiply Accumulate, aka Tensor Core) instructions Nvidia added with the Hopper architecture. Wonder what `gdesc` is maybe it's global descriptor ?

1

0

5

Kuter Dinel

@KuterDinel

2 months

Nvidia SASS Control Code Viewer.

Nvidia SASS Control Code Viewer

A viewer for Nvidia SASS instruction control codes (pipeline stall, read/write barriers, reuse flags).

kuterdinel.com

0

5

Kuter Dinel

@KuterDinel

5 months

@SethBling I want to see Angry Birds in Minecraft.

0

5

Kuter Dinel

@KuterDinel

6 months

@mayfer If you are just going to search words, maybe try word2vec ? It's old but really good. If you `vector(”King”) - vector(”Man”) + vector(”Woman”)`. The closest vector is vector is for Queen.

1

0

5

Kuter Dinel

@KuterDinel

10 months

@AutismCapital Will he ask to play League Of Legends during the trial as well ?

1

0

5

Kuter Dinel

@KuterDinel

8 months

@LiveOverflow I was thinking about the same thing ! Even just cleaning up decompiler output would be useful.

0

3

Kuter Dinel

@KuterDinel

21 days

ptxas is basically a small compiler. It tries to select uniform data path instructions if it can prove that the value is going to be same(uniform) across the warp. AMD RDNA3 has something similar and calls it 'Scalar ALU instructions' I believe.

0

4

Kuter Dinel

@KuterDinel

2 months

@lafaiel I wonder how much transistor budget is spent on decoding x86 instructions to μops.

1

0

2

Kuter Dinel

@KuterDinel

4 months

@Teknium1 I think the ui/ux wasn't done very well. I was a GPT4 user and didn't realize it actually was shipped for a long time.

0

4

Kuter Dinel

@KuterDinel

6 months

@far__el Is showing off nvidia-smi outputs the new trend after showing off MRR ? Jokes aside, may I have some of that Gpu power kind, sir. Just a hundred teraflops.

0

4

Kuter Dinel

@KuterDinel

3 months

My dream is to build software that everyone in the world will enjoy.

0

1

4

Kuter Dinel

@KuterDinel

6 months

@Mrwhosetheboss As a linux user, I feel tempted by the new Macbooks.

1

0

3

Kuter Dinel

@KuterDinel

10 months

@flyingcircuits Oh no, so sorry to hear about this. Also, I am curious why a SD card is used instead of satellite down link.

1

0

4

Kuter Dinel

@KuterDinel

21 days

There is also this paper that uses custom PTX to measure latencies. But looks like they confused uniform and regular data instructions in sass.

1

0

4

Kuter Dinel

@KuterDinel

5 months

@alexkoch_ai The next step is to make the robots self replicate 😅

0

4

Kuter Dinel

@KuterDinel

6 months

Drinking coffee at 2 am, will implement LLM inference in numpy. Grind is 4ever

1

0

2

Kuter Dinel

@KuterDinel

5 months

@atc1441 @lozaning Make it run Doom!

0

3

Kuter Dinel

@KuterDinel

10 months

I am writing a tutorial on building a small jit compiler that compiles a small subset of C into x64 machine code. Almost finished with it. Here is an example program.

1

0

3

Kuter Dinel

@KuterDinel

6 months

I made a Python Bytecode and AST explorer. Kinda like godbolt compiler explorer, but for Python !

Python Bytecode and AST Explorer

A Python Bytecode and AST explorer, similar to Godbolt Compiler Explorer but for Python.

kuterdinel.com

0

1

3

Kuter Dinel

@KuterDinel

25 days

Looking at the life range output from nvdisasm for the 64x8x16.F32 variant, the instruction reads and writes 4 GPRs per thread. For addressing gdesc 4 UGPRs are used ... interesting.

1

0

3

Kuter Dinel

@KuterDinel

7 months

@mervenoyann @youraimarketer Here is a screenshot of one of the test prompts I used with the model that I LoRa fine-tuned on turkish airoboros data. The model correctly answers the question asked ( I apologize for the red underlines. I tried a few other questions like this, and it answered those questions

1

3

Kuter Dinel

@KuterDinel

5 months

@karpathy @obsdmd Have you considered emacs org mode? I don't feel comfortable using a closed source note taking program.

0

3

Kuter Dinel

@KuterDinel

9 months

@trashh_dev It would be more cursed if the query was a string. ``` document.query("select innerText from div where class='trash'"); ```

0

3

Kuter Dinel

@KuterDinel

9 months

@TakoTreba Hey maybe quick actions should have a prefix to avoid confusion. Maybe like `!github` or `/github`

2

0

3

Kuter Dinel

@KuterDinel

24 days

I integrated nvdisasm life range info to the html output. Here is the DFMA (double fused multiply add) instruction. As you see each operand reads/writes 2 registers.

0

3

Kuter Dinel

@KuterDinel

1 month

Almost done with my nvdisasm fuzzer. Here is the recipe to encode the UTMASTG instruction.

1

0

3

Kuter Dinel

@KuterDinel

5 months

Just as a ML model gives incorrect results for out of distribution samples, people often make incorrect (often negative) assumptions about things that are `out of ordinary` for them.

0

1

Kuter Dinel

@KuterDinel

9 months

midnight art coding session.

0

3

Kuter Dinel

@KuterDinel

9 months

@mayfer I believe GPT-3 uses learned token embeddings instead of one-hot. Essentially each word is a vector of n size. It can be interesting to interpolate different word vectors to get weird in-between words. Interpolating positional embeddings has been used for extending LLaMAs

1

0

3

Kuter Dinel

@KuterDinel

10 months

@BenThePearman Very cool, I think you should consider doing something similar with tranpose convolution layers as well. Seeing the conv kernel move across the output to generate the image would be cool.

0

3

Kuter Dinel

@KuterDinel

10 months

After figuring out how to encode x64 instructions. arm64 feels like a breeze. Having 24bit offsets for control flow instructions feel a little bit weird though.

0

3

Kuter Dinel

@KuterDinel

1 month

Note that this doesn't include the control code section for the instruction.Will add that later. Most(all?) warp level instructions require that the read and write barriers are set to 7 (means disabled/unset). For more info on sass control codes

Nvidia SASS Control Code Viewer

A viewer for Nvidia SASS instruction control codes (pipeline stall, read/write barriers, reuse flags).

kuterdinel.com

0

1

3

Kuter Dinel

@KuterDinel

7 months

@youraimarketer Hey, there is also this dataset created by @mervenoyann . I am also experimenting with turkish LLMs. I suspect merve used Airoboros or something similar to it to generate this dataset. Airoboros data works well to make mistral models speak turkish.

merve/turkish_instructions · Datasets at Hugging Face

huggingface.co

2

0

3

Kuter Dinel

@KuterDinel

5 months

@shxf0072 I think what is crazier is the model is only trained with video and no action data! But it learns a `latent action model` and the actions are consistent across different generations.

0

1

Kuter Dinel

@KuterDinel

5 months

@AnanthVeluvali Maybe you need to pay for things like food & shelter?

0

2

Kuter Dinel

@KuterDinel

7 months

@grkn Merhabalar. Benzer bir hikayem var. Programlamaya 11 yaşında başladım. İyi bir öğrenci olmadığımdan Düzce üniversitesini kazanabildim. MIT OCW üzerinden algortima dersleri izledim. Yüzlerce leetcode sorusu çözdüm ve son senemde Google'da staj yaptım. En son işten çıkartmalar vs

1

0

2

Kuter Dinel

@KuterDinel

6 months

@johndmcmaster Microchip wafers look so aesthethic. I really want to have a few to frame and hang on a wall. The way they reflect light looks like peacoc feathers or the morpho butterfly. I think the underlying physics is the same, nano structures interfering with visible light (structural

0

2

Kuter Dinel

@KuterDinel

10 months

@DJSnM The announcer in the live stream said that the parachute deployed early as a smart decision. Here is the exact moment where this is said:

0

2

Kuter Dinel

@KuterDinel

7 months

@dingboard_ That doesnt sound right, how can you have x million users when its invite only ?

1

0

2

Kuter Dinel

@KuterDinel

6 months

@theapplehub Interesting camera alignment for the non pro iPhone 16. Is it like this to be able to capture "spatial video"?

0

2

Kuter Dinel

@KuterDinel

1 month

Need to run nvdisasm half a million times again.

0

2

Kuter Dinel

@KuterDinel

10 months

@sama I wonder if there are any plans to make gpt models directly process audio embeddings instead of just speach to text output, similar to how the image processing (probably) works.

0

Kuter Dinel

@KuterDinel

9 months

@mayfer Thanks. I didn't know that one-hot was used before the emebedding process. Mathematically multiplying a matrix with a one hot vector is the same as fetching the row where the 'one' is located.

1

0

2