Eric Quinnell @divBy_zero Twitter profile

Pinned Tweet

Eric Quinnell

4 days

Hot takes summary: * ISA does matter (ie var length) * OoO SMP beats SMT always * > 2-3GHz is negative perf/watt * no DMA in transport protocols * grep/bash > python at text parsing * OoO brp > OoO predicates, pick one

3

0

18

Last Seen Profiles

@iroha_04141

@Yano_Akiko

@Tljefferson23

@stw_pdg

@curagebesyaru

@0___ok

@FreedomShow1

@SamKSS

@Gokulna49810711

@honeyjjuni_

@Leticia72973810

@MikeStackpole

@tsarlet2

@Mine_wr

@metzger1899

@ZanpakutoPirate

@ptstaiwan

@BrandonSaho

@oAY6HBaMIUOtadb

@KorbutKeith

@kicona_hiratuka

@Porchman920338

@Shan_1947_14

@Renitive

@harumagedon114

@Real3DGhost1

@KosowiczNa65806

@SathyaPeetam

@jsusuaiejf78645

@JamesMitchhell

@CirclEdgeInc

@Olaitanore

@stw_pdg

@alexv6_rai

@BellaUk381899

@vsvonn1

Eric Quinnell

@divBy_zero

24 days

Berkeley's SLICE Lab invited me back to critique RISC-V's RVC and RVV extensions. Thread has full slide deck. The group is extremely impressive, and I thank them for the return opportunity to offer an alternative viewpoint.

6

38

218

Eric Quinnell

@divBy_zero

10 days

Tesla Transport Protocol over Ethernet (TTPoE) is now open sourced. Spec and example linux kernel model

GitHub - teslamotors/ttpoe

Contribute to teslamotors/ttpoe development by creating an account on GitHub.

github.com

2

32

137

Eric Quinnell

@divBy_zero

27 days

I’m flattered to have my TTPoE talk covered by such fine internet folks

Chips and Cheese

@ChipsandCheese9

27 days

Hello you fine Internet folks, Continuing our Hot Chips coverage, we are looking at Tesla's TTPoE (Tesla's Transport Protocol over Ethernet) which Tesla has made public along with Tesla joining the Ultra Ethernet Consortium (UEC). Hope y'all enjoy!

3

24

110

3

5

62

Eric Quinnell

@divBy_zero

5 months

x86 will never catch up to the IPC of ARM or where RISC-V is going without abandoning var-length opcodes. I got a lot of questions at work this week about Apple's M4 and 10-wide decoders. Note A9 had 8-wide, this is not a new phenomenon. CPU uarch decode width transformed from

19

11

45

Eric Quinnell

@divBy_zero

22 days

Undiluted copy of my slides from @hotchipsorg

ALEX

@ajtourville

23 days

Tesla slide deck at Hot Chips 2024 Conference August 27, 2024 DOJO: An Exa-Scale Lossy AI Network using the Tesla Transport Protocol over Ethernet (TTPoE) Eric Quinnell PhD, Dojo Fabric Lead

13

32

178

1

3

35

Eric Quinnell

@divBy_zero

12 days

This is what good engineering looks like. Un-sexy (memset) tuning to the machine you’re using. Note alignment to cachelines and datapath. (Emphasis directed to rvv and sve)

Geeknik`s {{☀️}} Lab

@geeknik

13 days

Behold! The arcane sorcery of Wilco Dijkstra's glibc patch catapults memset performance by 24% on Arm Neoverse-N1 cores, conjuring a digital alchemy that will ripple through Ampere Altra servers like caffeinated unicorns!

1

10

27

3

4

29

Eric Quinnell

@divBy_zero

1 year

uArch uLesson #10 : Permutes -- destroyer of chips It's not the FLOPs, it's the connections limiting your performance. N-to-N element selection is conceptually easy and tremendously difficult to implement well. Modern nodes are "wire dominant" -- i.e. the reduction of geometric

3

2

28

Eric Quinnell

@divBy_zero

24 days

@Cardyak

4

0

24

Eric Quinnell

@divBy_zero

24 days

Credit @Cardyak with permission on CPU block diagrams

4

0

23

Eric Quinnell

@divBy_zero

24 days

@Cardyak

3

0

19

Eric Quinnell

@divBy_zero

17 days

FP32 -> FP16 -> FP8 -> FP4 So when we go FP2, who’s getting voted off the island?

Lucretiel 🦀

@Lucretiel

18 days

19

792

8K

4

1

19

Eric Quinnell

@divBy_zero

27 days

@ChipsandCheese9 Great summary, you show yet again y’all really understand the low level details and tradeoffs. Thanks for the great writeup of our work

2

0

16

Eric Quinnell

@divBy_zero

3 months

@bogorad222 Yes, was there on the “small cpu” team — Jaguar (originally for netbooks — strange pivot that paid off). Don’t forget XboxOne same time, that helped float through boat till Zen as well. The team was uniquely excellent and survived with sarcasm, fatalism, and hopium.

0

1

16

Eric Quinnell

@divBy_zero

1 year

uArch uLesson #9 -- UOP Caches are the Worst (TM) Congrats ARM's Cortex-X4 for losing the MOP cache (like Apple!) Variable length instructions require UOP/MOP caches to decode/dispatch > 4 instr/cycle. Sorry x86. ARM/RISC-V need not have variable length in big cores -- the

1

3

11

Eric Quinnell

@divBy_zero

4 months

@matthewvenn Chip designers ballpark this power law as an order of magnitude per level of cache. Your chart doesn’t show L2 and L3 layers, but they would bridge that gap. This is also why you should have an order of magnitude more density and storage per level to make up for the cost.

0

1

13

Eric Quinnell

@divBy_zero

27 days

As said at @hotchipsorg , I’m looking forward to the @ultraethernet to open source Tesla’s TTPoE protocol. Our distributed ethernet network and DumbNICs at scale are a proof point that lossy transport is the right vision for massive scale AI. Can’t wait for NICs to be UltraDumb!

Dr. J Metz

@drjmetz

28 days

Major welcome and congratulations to @Tesla and @elonmusk for joining @ultraethernet . In less than a year we've grown to nearly 100 companies, all devoted to open, scalable, next-generation #Ethernet networks for AI and HPC workloads.

0

9

24

0

4

13

Eric Quinnell

@divBy_zero

2 months

RTL is software — but where all lines execute simultaneously. (My default reminder to impatient engineers wondering why something <100 lines takes so long)

0

13

Eric Quinnell

@divBy_zero

6 months

Nobody ever got fired for adding more cache

1

13

Eric Quinnell

@divBy_zero

4 months

@FelixCLC_ This meme needs to play everyday for the next 5 years until chip marketing gets the message

2

0

9

Eric Quinnell

@divBy_zero

1 year

uArch uLesson #4 -- Fused Multiply-Add (FMA) (From today's lecture at UT-Austin) D = (AxB) + C Invented at IBM for the RS/6000 to accelerate dot-products The bedrock of GPUs and AI computation, including the newest LLM QKV "attention/transformer" algorithms

2

9

Eric Quinnell

@divBy_zero

27 days

@tim_zaman @PTrubey Yes, I mentioned in the talk the video input tensors are very large, and said almost exactly your point where tremendous amounts of the info is redundant. But until NNs can decompress better/more efficiently than dedicated decoders, the beast of video ingress remains

0

8

Eric Quinnell

@divBy_zero

4 months

Changes to anything in the tech stack, no matter how trivial, guarantees breaking things you didn’t even know existed. It’s maddening. There’s so much garbage and generational hacking in these machines, it’s a small miracle computers or networks work at all.

1

0

8

Eric Quinnell

@divBy_zero

8 months

#14 : Python will overtake C++ I was writing a C++ kernel to measure the duration of a task, and a younger engineer suggested a far more intuitive python equivalent. C++ is pedantic, precise, and will code to "the metal", with an order of magnitude fewer instructions, better

0

5

Eric Quinnell

@divBy_zero

1 year

One day you LLM junkies will figure out you’re buying chips for good HBM bw and fabric/IO, not for GPU/FLOPs. Transformers for text can run on RaspberryPis with good HBM/NoCs and you wouldn’t tell the difference

1

2

6

Eric Quinnell

@divBy_zero

2 months

@rzidane360 Also the first chicken bit to disable when stuff doesn’t work. “We’ll get back to it later”. No, no you won’t.

0

1

6

Eric Quinnell

@divBy_zero

22 days

@Quartr_App @Rainmaker1973 @kenshirriff @thijsvanderspil @nrossolillo @StockSavvyShay @techfund1 @ehrazahmedd @danielnewmanUV @dissectmarkets @ftr_investors I keep seeing this chart every 5-10 years and it keeps being completely wrong. Moore’s law is really just humans solving problems. Process engineers aren’t running out of ideas by any means and will keep making it better and cheaper at scale

0

7

Eric Quinnell

@divBy_zero

1 year

@never_released But does it set up the A20 bit? If you’re gonna jank, jank it right

0

5

Eric Quinnell

@divBy_zero

6 months

@ghidraninja Absolutely zero. Regardless of endian you have to wait around to calculate FEC and CRC, so you’re gonna wait for all endian anyway to see what bits went missing

0

5

Eric Quinnell

@divBy_zero

1 year

uArch uLesson #6 -- Cache Replacement Cache replacement is not only about maximizing performance with predictive residency, but about reducing high-power I/O accesses. Gold standard and baseline is the RRIP algorithm from @AamerJaleel rather than LRU.

0

1

4

Eric Quinnell

@divBy_zero

9 months

#13 : Virtual Channels (VC) In NOCs, caches, meshes, ethernet, and interconnects, a "Physical Channel" (PC) represents a physical media (i.e. wires, optics) that transmits bits. Different classes of traffic (e.g. snoops/probes vs data transmission) sharing a PC will inevitably

1

5

Eric Quinnell

@divBy_zero

1 year

uArch uLesson #8 -- Skid Buffers Transported data across deep pipelines cannot same-cycle start/stop the whole pipe on collision events. The classic solution is a credit/token controlled "Skid" Buffer. When a remote endpoint needs to stall, it has sufficient local storage to

1

2

4

Eric Quinnell

@divBy_zero

1 year

uArch uLesson #5 - The A20 bit In a stunning example of unintended consequences to support legacy code, IBM "hacked" in a gate mask for the 20th physical address bit when using the Intel 20286. Undoing this mistake took 30 years on CPUs alone.

A20 line - Wikipedia

en.wikipedia.org

0

1

4

Eric Quinnell

@divBy_zero

1 month

@lamchester This is the price of variable length decode, period. And “90% coverage from mop cache” running spec is cute and bypasses the dispatch bottleneck, but nobody runs spec sized kernels in real life. x86 HAS to pump frequency to match apple ipc, they cannot width their way there.

0

4

Eric Quinnell

@divBy_zero

5 months

@sirmo Here’s where I disagree. Two half sized OoO cores at 80% peak IPC in same footprint as x86 OoO SMT, which gets 1.3X on a tuned SMT app, is now losing to the 1.6X thru-put of the two half size cores. Half size cores can fully power off at idle independently and not self

0

4

Eric Quinnell

@divBy_zero

1 month

“Write a haiku about PXE” Grok: PXE boots fast, Network, server, magic dance, Workstation awake ChatGPT: Boots from network’s call, Silent images unfold— PXE guides the way. Note the difference in Syllables for ‘PXE’

1

0

4

Eric Quinnell

@divBy_zero

4 months

@Cardyak 10-wide decode, 8x int ALUs, 4xLD, 2xST will chew through llvm, JS, and python indirect threading code like God intended. Nitpick, decode should be 12-16 to match 64B fetch max burst, and 6x FADD/FMUL is a bit overkill that’s gonna be dark silicon outside benchmarking.

1

0

4

Eric Quinnell

@divBy_zero

2 years

uarch micro-lesson #1 Branch Prediction is a power law, as observed by @djimeneth . Is code, therefore, constrained by natural law? Notes and slides avail from Berkeley lecture

1

0

4

Eric Quinnell

@divBy_zero

1 year

@Michael_J_Black I disagree completely, the motivation is exactly inverted. One should PhD (or patent or publish) with a passion for *an idea* that can be realized only available along that path. Skills and expertise are granted to advance the field, not improve your vanity. Chasing

0

4

Eric Quinnell

@divBy_zero

1 year

@rayanhtt Tesla

0

4

Eric Quinnell

@divBy_zero

4 months

printf and breakpoints are part of Heisenberg's uncertainty principle. You only think you fixed that bug bc you’re looking at it, but you didn’t.

1

0

4

Eric Quinnell

@divBy_zero

5 months

@blu51899890 @ghostway_chess @damageboy Oh ya, and that. So much collateral damage from var length, it’s hard to enumerate it all

1

0

3

Eric Quinnell

@divBy_zero

7 months

Damn kids and their pytorch. Back in my day we could object dump and see what was up

1

0

3

Eric Quinnell

@divBy_zero

6 months

@fermatslibrary Lol, did they include their own paper in the “not disruptive” list?

0

2

Eric Quinnell

@divBy_zero

2 years

If you can fix a bug in HW or SW, it is always fixed in HW. Not bc one is easier than another. Rather, HW already ECO’d on their laptop in the same bug meeting where SW is still enumerating all their new acronyms and toolchains

0

3

Eric Quinnell

@divBy_zero

2 months

@rzidane360 +1 to “stamp out fmas instead of higher freq”. To that point, only thing I’d change on the link is to say “reduce V until it barely functions” — frequency is dependent on voltage, but part of ‘C’ below. It’s V^3 then, not squared.

0

3

Eric Quinnell

@divBy_zero

17 days

@cmuratori @MarioVerbelen @ThePrimeagen Yes, agreed, and the timing pressure. Countless conversations about how to get away from 4k pages and go 16k at least, but too much infra already in 4k

1

0

3

Eric Quinnell

@divBy_zero

2 months

@FelixCLC_ From their own spec: For embedded I concede. “Energy efficiency for all applications” is total horse 💩, especially self citing the authors’ masters thesis. 25% my ass (brp defeats this instantly)

0

3

Eric Quinnell

@divBy_zero

1 year

uArch uLesson #7 -- Comparing Compute Comparing FLOPS is analogous to comparing cathedrals by square footage. The system is far too complex for such a trivial metric. All that matters is your wall-power and perf curve on your actual application -- system, software, all of it.

0

2

3

Eric Quinnell

@divBy_zero

9 months

@rich_coupe @kenshirriff Right, agreed, but you get the voltage loss on the pull up with nmos only. With sense amps it does not matter, the output of interest is the speed of detection rather than saturation level.

1

0

3

Eric Quinnell

@divBy_zero

11 months

Grok is passing first tier verilog hardware interview questions

0

3

Eric Quinnell

@divBy_zero

1 year

@itsclivetime An astute observation and the answer is “sort of”. You would end up overloading other wires to achieve the same per element, much like the ucode in cisc instructions. I.e. you’re now using the mux-down wires from the register/sram entries you temporarily store with to iterate

1

0

3

Eric Quinnell

@divBy_zero

1 year

@jbhuang0604 Disagree, ppl need to break this model of indentured servitude. They’re being paid, while you are not. You create the product which they get honor for, not the other way around. Nobody cares whom you studied under; they’re not the demi-gods you (and they) currently think they

0

3

Eric Quinnell

@divBy_zero

1 month

@FelixCLC_ @tenstorrent @jimkxa @DavidBennett__ Fantastic

1

0

3

Eric Quinnell

@divBy_zero

1 year

FLOPs/Watt is ruining the HW world BW/Watt is better. Inferences or activations or trained weights/Watt is more better. What is most better?

0

3

Eric Quinnell

@divBy_zero

10 months

AI is the most interesting field for computer architects right now. 5-10% avg utilization means it's completely and fundamentally incorrect. What's the right architecture? We don't know yet, but it's not what we have now.

0

3

Eric Quinnell

@divBy_zero

2 years

AI rate of advancement is due to arXiv. Patents are a write-only memory with a 2 year publish latency. Peer-review conferences have become the curation of power structures. The free flow of ideas is the true singularity

0

3

Eric Quinnell

@divBy_zero

24 days

@Cardyak

1

0

4

Eric Quinnell

@divBy_zero

1 month

This is profound for 2 reasons: 1) Exact token inputs resulting in two different subtle pronunciations of non technical terms is mind blowing. There’s weights in there making this differentiation. 2) I probably should take a break from booting stuff in the datacenter

1

0

3

Eric Quinnell

@divBy_zero

1 year

@O42nl Chipoltle on Camino Real. Bc out of staters (like me) recognize the names of the ingredients and they don’t put “grass jelly” or other weird crap in your tea

0

2

Eric Quinnell

@divBy_zero

9 months

@rich_coupe @kenshirriff Not officially, yes. Note ARM’s Cortex-X4 is 10-way superscalar, as is I think the latest apple core. The real trick to this is no variable length instructions

1

0

2

Eric Quinnell

@divBy_zero

7 months

@rzidane360 Great talk, reposting. Friendly reminder that your meat-NN is mostly I/O and interconnect, not cortex.

0

2

Eric Quinnell

@divBy_zero

27 days

@dannysaigal @SawyerMerritt @WholeMarsBlog @Tesla Lol, or we didn’t close the door all the way when taking the picture

0

2

Eric Quinnell

@divBy_zero

4 months

@2sush If a candidate drew this on the whiteboard in an interview I would cancel the remaining panel and hire them immediately

0

1

Eric Quinnell

@divBy_zero

3 months

@yacineMTB I believe that’s covered in 802.3fq, be sure to adhere to all 1000 pages of spec

0

2

Eric Quinnell

@divBy_zero

16 days

@YunTaTsai1 @francoisfleuret They just don’t get the objectivity of it all

1

0

2

Eric Quinnell

@divBy_zero

9 months

@rich_coupe @kenshirriff All I see in industry is 6T bc cmos digital is extremely forgiving to process deviations, especially since FinFETs. But that’s anecdotal, perhaps some players still 4T?

1

0

2

Eric Quinnell

@divBy_zero

17 days

@vvedavyas @Anno0770 @Cardyak Persistence pays off. Sent pdf

1

0

2

Eric Quinnell

@divBy_zero

3 months

Lossless fabric is the worst

0

2

Eric Quinnell

@divBy_zero

1 month

@itsclivetime @united PS — please don’t type your predefined answers. Send us a letter

0

2

Eric Quinnell

@divBy_zero

5 months

@never_released @blu51899890 @ghostway_chess @damageboy Agreed the 2B/4B extension is self harming for perf cores. As long as it remains optional good uarch can avoid the pitfall. I also think RV is on dangerous territory with this one. Yes arm learned this the hard way with thumb and apple ripped it out even sooner. BTB pollution

2

0

2

Eric Quinnell

@divBy_zero

2 years

uArch uLesson #2 -- Hot Ones Diagram of hot ones comes from IBM's CELL fpu paper. Radix-4 booth encoders are still in vogue due to easy power-of-two shifting x(-2, -1, 0, 1, 2), with the 2's complement +1 added in "hot ones" in later partial products

0

2

Eric Quinnell

@divBy_zero

5 months

@FelixCLC_ 100Gbps of network connectivity?! Slow down there, cowboy. So like…one NIC?

0

2

Eric Quinnell

@divBy_zero

1 year

@_akhaliq Trade them in for some Dojos

0

2

Eric Quinnell

@divBy_zero

2 months

New idea — we can prevent the AI robot apocalypse by installing VMs as their default OS!

0

3

Eric Quinnell

@divBy_zero

27 days

@entropicEm During the talk I pointed out the all reduce is a mere characterization of the network in isolation. You are correct, our SOW all reduce on wafer is insanely fast. Combined with TTPoE and pipelining, our full all reduce in system is ridiculously good

0

2

Eric Quinnell

@divBy_zero

1 year

Alternative uArchs are prolific in literature and industry, all finding ways to avoid the tyranny of the addend alignment. But sometimes the original uArch remains the best choice, especially when needing minimum area density and choosing throughput over latency

0

2

Eric Quinnell

@divBy_zero

4 months

@damageboy @TheKanter @jordanschnyc @dylan522p @_fabknowledge_ @asianometry To be fair, SVE is ARM’s ISA stupid tax. Addressing modes is RV’s. So they all have one. Var length punishes most

2

0

2

Eric Quinnell

@divBy_zero

9 months

@rich_coupe @kenshirriff The sense amp is usually at the end of the precharged shared bit columns of rows of 6Ts to accelerate the readout. The 4T shared is a false 4T, any “resistors” on modern nodes are most commonly themselves transistors as well

1

0

2

Eric Quinnell

@divBy_zero

2 years

uArch uLesson #3 -- Floating-Point Rounding IEEE-754 has four rounding requirements, but how to implement them? By special bits around the LSB or ULP combined with the sign bit The Guard, Round, Sticky, and Carry Bits

1

0

2

Eric Quinnell

@divBy_zero

2 months

@rzidane360 @FelixCLC_ “It’s only 2 gate delays!”

0

2

Eric Quinnell

@divBy_zero

22 days

@CamelCdr @Cardyak Fair enough, just pointing out it is definitely a different direction. Btw your points make great discussion and I was initially replying to lots — but will give space to all the other techies who clearly showed up. I had my turn :)

1

0

2

Eric Quinnell

@divBy_zero

1 year

@bernhardsson Bc @FFmpeg community just gave up on codecs that get more than 1 IPC

0

2

Eric Quinnell

@divBy_zero

2 years

A picture is worth 1,000 words. A video is 60 pictures per second. Does that make visual AI worth 60,000 ChatGPTs?

0

2

Eric Quinnell

@divBy_zero

2 years

Every twitter python code sharing example. import theThing theThing.doTheThing() The insight is just incredible.

Python Coding

@clcoding

2 years

Day 143 : Python code for creating a joy plot

4

88

500

0

2

Eric Quinnell

@divBy_zero

8 months

@radshaan “Learning GNU Emacs” and “Mastering Regular Expressions” are not books one internalizes. This collection is mostly classroom books or reference manuals, not stuff you sit with a glass of wine by the fire

0

2

Eric Quinnell

@divBy_zero

16 days

@davorVDR So @tenstorrent voted exponent off the island then for FP2! Ok, so when’s FP1 :) Choosing max exponent and normalizing the mantissa is clever, you capture the range +/- the mantissa width and some precision. Is there an equal/opposite block float where mantissa is shared and

1

0

2

Eric Quinnell

@divBy_zero

9 months

@itsclivetime Agreed. This is closer to SRT algorithm vs the newton/raphson method (multiplier based). This will be much smaller in LUTs at expense of latency. You could modify the above to a higher radix (try radix 16) and make it even faster with little extra gates.

0

1

Eric Quinnell

@divBy_zero

5 months

@HaroldAptroot @st01014 Avg cpu freq 4.7GHz. Here’s the blind spot. Run it at 2GHz bc that’s sustainable beyond a benchmark

1

0

1

Eric Quinnell

@divBy_zero

3 months

@YunTaTsai1 All the ones that are 3 years or older

0

1

Eric Quinnell

@divBy_zero

2 years

Auto-correlation shows how relevant each previous branch in the kernel history is to predicting the current branch

1

0

1

Eric Quinnell

@divBy_zero

2 years

Round to Nearest Even example 2, the corner case

0

1

Eric Quinnell

@divBy_zero

6 months

@glennklockwood FP64 appears in all workloads where you’re in a rush with no time to weed out why someone put it in Academic papers on quad precision datapaths rely on quad precision

0

1

Eric Quinnell

@divBy_zero

8 months

@rpoo Urgency is directly proportional to stress. And directly proportional to results. Knowingly choose that path and all it comes with

0

1

Eric Quinnell

@divBy_zero

22 days

@bjorntopel Yes it was recorded, but for Berkeley only. I’d rather not have grad students’ questions or arguments vetted by public forums. My arguments? Go for it, internet

0

1

Eric Quinnell

@divBy_zero

5 months

@Prakhar6200 Noted, as well as your citations of no SMT and isa trying to move to fixed width and removing some old junk (A20 bit etc). It’s moving the right direction after useless adventures into memory transactions and AVX-width-so-big-didt-throttle-oops wastes of time. But many

1

0

1

Eric Quinnell

@divBy_zero

2 years

@itsclivetime It really likes LeBron

0

1

Eric Quinnell

@divBy_zero

2 months

@RyanEls4 This post may not get a reliable return

0

1

Eric Quinnell

@divBy_zero

11 days

@essobi @FFmpeg @NVIDIAHPCDev You had me at free bsd tattoo. But ya, don’t ever stop optimizing for coalesced datapaths. Bc that would be really dumb, and you don’t look dumb

0

1

Eric Quinnell

@divBy_zero

8 months

A colleague mentioned to me it is bc losses are differentiable and can be minimized, as opposed to gains that can go to infinity and not be differentiated

0

1