Eric Quinnell Profile
Eric Quinnell

@divBy_zero

937
Followers
171
Following
34
Media
192
Statuses

Tesla Dojo, fmr ARM, Samsung, AMD

Joined December 2022
Don't wanna be here? Send us removal request.
Pinned Tweet
@divBy_zero
Eric Quinnell
4 days
Hot takes summary: * ISA does matter (ie var length) * OoO SMP beats SMT always * > 2-3GHz is negative perf/watt * no DMA in transport protocols * grep/bash > python at text parsing * OoO brp > OoO predicates, pick one
3
0
18
@divBy_zero
Eric Quinnell
24 days
Berkeley's SLICE Lab invited me back to critique RISC-V's RVC and RVV extensions. Thread has full slide deck. The group is extremely impressive, and I thank them for the return opportunity to offer an alternative viewpoint.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
6
38
218
@divBy_zero
Eric Quinnell
10 days
Tesla Transport Protocol over Ethernet (TTPoE) is now open sourced. Spec and example linux kernel model
2
32
137
@divBy_zero
Eric Quinnell
27 days
I’m flattered to have my TTPoE talk covered by such fine internet folks
@ChipsandCheese9
Chips and Cheese
27 days
Hello you fine Internet folks, Continuing our Hot Chips coverage, we are looking at Tesla's TTPoE (Tesla's Transport Protocol over Ethernet) which Tesla has made public along with Tesla joining the Ultra Ethernet Consortium (UEC). Hope y'all enjoy!
3
24
110
3
5
62
@divBy_zero
Eric Quinnell
5 months
x86 will never catch up to the IPC of ARM or where RISC-V is going without abandoning var-length opcodes. I got a lot of questions at work this week about Apple's M4 and 10-wide decoders. Note A9 had 8-wide, this is not a new phenomenon. CPU uarch decode width transformed from
19
11
45
@divBy_zero
Eric Quinnell
22 days
Undiluted copy of my slides from @hotchipsorg
@ajtourville
ALEX
23 days
Tesla slide deck at Hot Chips 2024 Conference August 27, 2024 DOJO: An Exa-Scale Lossy AI Network using the Tesla Transport Protocol over Ethernet (TTPoE) Eric Quinnell PhD, Dojo Fabric Lead
Tweet media one
13
32
178
1
3
35
@divBy_zero
Eric Quinnell
12 days
This is what good engineering looks like. Un-sexy (memset) tuning to the machine you’re using. Note alignment to cachelines and datapath. (Emphasis directed to rvv and sve)
Tweet media one
@geeknik
Geeknik`s {{☀️}} Lab
13 days
Behold! The arcane sorcery of Wilco Dijkstra's glibc patch catapults memset performance by 24% on Arm Neoverse-N1 cores, conjuring a digital alchemy that will ripple through Ampere Altra servers like caffeinated unicorns!
1
10
27
3
4
29
@divBy_zero
Eric Quinnell
1 year
uArch uLesson #10 : Permutes -- destroyer of chips It's not the FLOPs, it's the connections limiting your performance. N-to-N element selection is conceptually easy and tremendously difficult to implement well. Modern nodes are "wire dominant" -- i.e. the reduction of geometric
Tweet media one
3
2
28
@divBy_zero
Eric Quinnell
24 days
Tweet media one
Tweet media two
Tweet media three
Tweet media four
4
0
24
@divBy_zero
Eric Quinnell
24 days
Credit @Cardyak with permission on CPU block diagrams
Tweet media one
Tweet media two
Tweet media three
Tweet media four
4
0
23
@divBy_zero
Eric Quinnell
24 days
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
0
19
@divBy_zero
Eric Quinnell
17 days
FP32 -> FP16 -> FP8 -> FP4 So when we go FP2, who’s getting voted off the island?
@Lucretiel
Lucretiel 🦀
18 days
Tweet media one
19
792
8K
4
1
19
@divBy_zero
Eric Quinnell
27 days
@ChipsandCheese9 Great summary, you show yet again y’all really understand the low level details and tradeoffs. Thanks for the great writeup of our work
2
0
16
@divBy_zero
Eric Quinnell
3 months
@bogorad222 Yes, was there on the “small cpu” team — Jaguar (originally for netbooks — strange pivot that paid off). Don’t forget XboxOne same time, that helped float through boat till Zen as well. The team was uniquely excellent and survived with sarcasm, fatalism, and hopium.
0
1
16
@divBy_zero
Eric Quinnell
1 year
uArch uLesson #9 -- UOP Caches are the Worst (TM) Congrats ARM's Cortex-X4 for losing the MOP cache (like Apple!) Variable length instructions require UOP/MOP caches to decode/dispatch > 4 instr/cycle. Sorry x86. ARM/RISC-V need not have variable length in big cores -- the
Tweet media one
Tweet media two
Tweet media three
1
3
11
@divBy_zero
Eric Quinnell
4 months
@matthewvenn Chip designers ballpark this power law as an order of magnitude per level of cache. Your chart doesn’t show L2 and L3 layers, but they would bridge that gap. This is also why you should have an order of magnitude more density and storage per level to make up for the cost.
0
1
13
@divBy_zero
Eric Quinnell
27 days
As said at @hotchipsorg , I’m looking forward to the @ultraethernet to open source Tesla’s TTPoE protocol. Our distributed ethernet network and DumbNICs at scale are a proof point that lossy transport is the right vision for massive scale AI. Can’t wait for NICs to be UltraDumb!
@drjmetz
Dr. J Metz
28 days
Major welcome and congratulations to @Tesla and @elonmusk for joining @ultraethernet . In less than a year we've grown to nearly 100 companies, all devoted to open, scalable, next-generation #Ethernet networks for AI and HPC workloads.
0
9
24
0
4
13
@divBy_zero
Eric Quinnell
2 months
RTL is software — but where all lines execute simultaneously. (My default reminder to impatient engineers wondering why something <100 lines takes so long)
0
0
13
@divBy_zero
Eric Quinnell
6 months
Nobody ever got fired for adding more cache
1
1
13
@divBy_zero
Eric Quinnell
4 months
@FelixCLC_ This meme needs to play everyday for the next 5 years until chip marketing gets the message
2
0
9
@divBy_zero
Eric Quinnell
1 year
uArch uLesson #4 -- Fused Multiply-Add (FMA) (From today's lecture at UT-Austin) D = (AxB) + C Invented at IBM for the RS/6000 to accelerate dot-products The bedrock of GPUs and AI computation, including the newest LLM QKV "attention/transformer" algorithms
Tweet media one
2
2
9
@divBy_zero
Eric Quinnell
27 days
@tim_zaman @PTrubey Yes, I mentioned in the talk the video input tensors are very large, and said almost exactly your point where tremendous amounts of the info is redundant. But until NNs can decompress better/more efficiently than dedicated decoders, the beast of video ingress remains
0
0
8
@divBy_zero
Eric Quinnell
4 months
Changes to anything in the tech stack, no matter how trivial, guarantees breaking things you didn’t even know existed. It’s maddening. There’s so much garbage and generational hacking in these machines, it’s a small miracle computers or networks work at all.
1
0
8
@divBy_zero
Eric Quinnell
8 months
#14 : Python will overtake C++ I was writing a C++ kernel to measure the duration of a task, and a younger engineer suggested a far more intuitive python equivalent. C++ is pedantic, precise, and will code to "the metal", with an order of magnitude fewer instructions, better
Tweet media one
Tweet media two
0
0
5
@divBy_zero
Eric Quinnell
1 year
One day you LLM junkies will figure out you’re buying chips for good HBM bw and fabric/IO, not for GPU/FLOPs. Transformers for text can run on RaspberryPis with good HBM/NoCs and you wouldn’t tell the difference
1
2
6
@divBy_zero
Eric Quinnell
2 months
@rzidane360 Also the first chicken bit to disable when stuff doesn’t work. “We’ll get back to it later”. No, no you won’t.
0
1
6
@divBy_zero
Eric Quinnell
22 days
@Quartr_App @Rainmaker1973 @kenshirriff @thijsvanderspil @nrossolillo @StockSavvyShay @techfund1 @ehrazahmedd @danielnewmanUV @dissectmarkets @ftr_investors I keep seeing this chart every 5-10 years and it keeps being completely wrong. Moore’s law is really just humans solving problems. Process engineers aren’t running out of ideas by any means and will keep making it better and cheaper at scale
0
0
7
@divBy_zero
Eric Quinnell
1 year
@never_released But does it set up the A20 bit? If you’re gonna jank, jank it right
0
0
5
@divBy_zero
Eric Quinnell
6 months
@ghidraninja Absolutely zero. Regardless of endian you have to wait around to calculate FEC and CRC, so you’re gonna wait for all endian anyway to see what bits went missing
0
0
5
@divBy_zero
Eric Quinnell
1 year
uArch uLesson #6 -- Cache Replacement Cache replacement is not only about maximizing performance with predictive residency, but about reducing high-power I/O accesses. Gold standard and baseline is the RRIP algorithm from @AamerJaleel rather than LRU.
Tweet media one
0
1
4
@divBy_zero
Eric Quinnell
9 months
#13 : Virtual Channels (VC) In NOCs, caches, meshes, ethernet, and interconnects, a "Physical Channel" (PC) represents a physical media (i.e. wires, optics) that transmits bits. Different classes of traffic (e.g. snoops/probes vs data transmission) sharing a PC will inevitably
Tweet media one
1
1
5
@divBy_zero
Eric Quinnell
1 year
uArch uLesson #8 -- Skid Buffers Transported data across deep pipelines cannot same-cycle start/stop the whole pipe on collision events. The classic solution is a credit/token controlled "Skid" Buffer. When a remote endpoint needs to stall, it has sufficient local storage to
Tweet media one
1
2
4
@divBy_zero
Eric Quinnell
1 year
uArch uLesson #5 - The A20 bit In a stunning example of unintended consequences to support legacy code, IBM "hacked" in a gate mask for the 20th physical address bit when using the Intel 20286. Undoing this mistake took 30 years on CPUs alone.
0
1
4
@divBy_zero
Eric Quinnell
1 month
@lamchester This is the price of variable length decode, period. And “90% coverage from mop cache” running spec is cute and bypasses the dispatch bottleneck, but nobody runs spec sized kernels in real life. x86 HAS to pump frequency to match apple ipc, they cannot width their way there.
0
0
4
@divBy_zero
Eric Quinnell
5 months
@sirmo Here’s where I disagree. Two half sized OoO cores at 80% peak IPC in same footprint as x86 OoO SMT, which gets 1.3X on a tuned SMT app, is now losing to the 1.6X thru-put of the two half size cores. Half size cores can fully power off at idle independently and not self
0
0
4
@divBy_zero
Eric Quinnell
1 month
“Write a haiku about PXE” Grok: PXE boots fast, Network, server, magic dance, Workstation awake ChatGPT: Boots from network’s call, Silent images unfold— PXE guides the way. Note the difference in Syllables for ‘PXE’
1
0
4
@divBy_zero
Eric Quinnell
4 months
@Cardyak 10-wide decode, 8x int ALUs, 4xLD, 2xST will chew through llvm, JS, and python indirect threading code like God intended. Nitpick, decode should be 12-16 to match 64B fetch max burst, and 6x FADD/FMUL is a bit overkill that’s gonna be dark silicon outside benchmarking.
1
0
4
@divBy_zero
Eric Quinnell
2 years
uarch micro-lesson #1 Branch Prediction is a power law, as observed by @djimeneth . Is code, therefore, constrained by natural law? Notes and slides avail from Berkeley lecture
Tweet media one
1
0
4
@divBy_zero
Eric Quinnell
1 year
@Michael_J_Black I disagree completely, the motivation is exactly inverted. One should PhD (or patent or publish) with a passion for *an idea* that can be realized only available along that path. Skills and expertise are granted to advance the field, not improve your vanity. Chasing
0
0
4
@divBy_zero
Eric Quinnell
1 year
0
0
4
@divBy_zero
Eric Quinnell
4 months
printf and breakpoints are part of Heisenberg's uncertainty principle. You only think you fixed that bug bc you’re looking at it, but you didn’t.
1
0
4
@divBy_zero
Eric Quinnell
5 months
@blu51899890 @ghostway_chess @damageboy Oh ya, and that. So much collateral damage from var length, it’s hard to enumerate it all
1
0
3
@divBy_zero
Eric Quinnell
7 months
Damn kids and their pytorch. Back in my day we could object dump and see what was up
1
0
3
@divBy_zero
Eric Quinnell
6 months
@fermatslibrary Lol, did they include their own paper in the “not disruptive” list?
0
0
2
@divBy_zero
Eric Quinnell
2 years
If you can fix a bug in HW or SW, it is always fixed in HW. Not bc one is easier than another. Rather, HW already ECO’d on their laptop in the same bug meeting where SW is still enumerating all their new acronyms and toolchains
0
0
3
@divBy_zero
Eric Quinnell
2 months
@rzidane360 +1 to “stamp out fmas instead of higher freq”. To that point, only thing I’d change on the link is to say “reduce V until it barely functions” — frequency is dependent on voltage, but part of ‘C’ below. It’s V^3 then, not squared.
Tweet media one
0
0
3
@divBy_zero
Eric Quinnell
17 days
@cmuratori @MarioVerbelen @ThePrimeagen Yes, agreed, and the timing pressure. Countless conversations about how to get away from 4k pages and go 16k at least, but too much infra already in 4k
1
0
3
@divBy_zero
Eric Quinnell
2 months
@FelixCLC_ From their own spec: For embedded I concede. “Energy efficiency for all applications” is total horse 💩, especially self citing the authors’ masters thesis. 25% my ass (brp defeats this instantly)
Tweet media one
0
0
3
@divBy_zero
Eric Quinnell
1 year
uArch uLesson #7 -- Comparing Compute Comparing FLOPS is analogous to comparing cathedrals by square footage. The system is far too complex for such a trivial metric. All that matters is your wall-power and perf curve on your actual application -- system, software, all of it.
Tweet media one
0
2
3
@divBy_zero
Eric Quinnell
9 months
@rich_coupe @kenshirriff Right, agreed, but you get the voltage loss on the pull up with nmos only. With sense amps it does not matter, the output of interest is the speed of detection rather than saturation level.
1
0
3
@divBy_zero
Eric Quinnell
11 months
Grok is passing first tier verilog hardware interview questions
Tweet media one
0
0
3
@divBy_zero
Eric Quinnell
1 year
@itsclivetime An astute observation and the answer is “sort of”. You would end up overloading other wires to achieve the same per element, much like the ucode in cisc instructions. I.e. you’re now using the mux-down wires from the register/sram entries you temporarily store with to iterate
1
0
3
@divBy_zero
Eric Quinnell
1 year
@jbhuang0604 Disagree, ppl need to break this model of indentured servitude. They’re being paid, while you are not. You create the product which they get honor for, not the other way around. Nobody cares whom you studied under; they’re not the demi-gods you (and they) currently think they
0
0
3
@divBy_zero
Eric Quinnell
1 year
FLOPs/Watt is ruining the HW world BW/Watt is better. Inferences or activations or trained weights/Watt is more better. What is most better?
0
0
3
@divBy_zero
Eric Quinnell
10 months
AI is the most interesting field for computer architects right now. 5-10% avg utilization means it's completely and fundamentally incorrect. What's the right architecture? We don't know yet, but it's not what we have now.
0
0
3
@divBy_zero
Eric Quinnell
2 years
AI rate of advancement is due to arXiv. Patents are a write-only memory with a 2 year publish latency. Peer-review conferences have become the curation of power structures. The free flow of ideas is the true singularity
0
0
3
@divBy_zero
Eric Quinnell
24 days
Tweet media one
Tweet media two
Tweet media three
1
0
4
@divBy_zero
Eric Quinnell
1 month
This is profound for 2 reasons: 1) Exact token inputs resulting in two different subtle pronunciations of non technical terms is mind blowing. There’s weights in there making this differentiation. 2) I probably should take a break from booting stuff in the datacenter
1
0
3
@divBy_zero
Eric Quinnell
1 year
@O42nl Chipoltle on Camino Real. Bc out of staters (like me) recognize the names of the ingredients and they don’t put “grass jelly” or other weird crap in your tea
0
0
2
@divBy_zero
Eric Quinnell
9 months
@rich_coupe @kenshirriff Not officially, yes. Note ARM’s Cortex-X4 is 10-way superscalar, as is I think the latest apple core. The real trick to this is no variable length instructions
1
0
2
@divBy_zero
Eric Quinnell
7 months
@rzidane360 Great talk, reposting. Friendly reminder that your meat-NN is mostly I/O and interconnect, not cortex.
0
0
2
@divBy_zero
Eric Quinnell
27 days
@dannysaigal @SawyerMerritt @WholeMarsBlog @Tesla Lol, or we didn’t close the door all the way when taking the picture
0
0
2
@divBy_zero
Eric Quinnell
4 months
@2sush If a candidate drew this on the whiteboard in an interview I would cancel the remaining panel and hire them immediately
0
0
1
@divBy_zero
Eric Quinnell
3 months
@yacineMTB I believe that’s covered in 802.3fq, be sure to adhere to all 1000 pages of spec
0
0
2
@divBy_zero
Eric Quinnell
16 days
@YunTaTsai1 @francoisfleuret They just don’t get the objectivity of it all
1
0
2
@divBy_zero
Eric Quinnell
9 months
@rich_coupe @kenshirriff All I see in industry is 6T bc cmos digital is extremely forgiving to process deviations, especially since FinFETs. But that’s anecdotal, perhaps some players still 4T?
1
0
2
@divBy_zero
Eric Quinnell
17 days
@vvedavyas @Anno0770 @Cardyak Persistence pays off. Sent pdf
1
0
2
@divBy_zero
Eric Quinnell
3 months
Lossless fabric is the worst
0
0
2
@divBy_zero
Eric Quinnell
1 month
@itsclivetime @united PS — please don’t type your predefined answers. Send us a letter
0
0
2
@divBy_zero
Eric Quinnell
5 months
@never_released @blu51899890 @ghostway_chess @damageboy Agreed the 2B/4B extension is self harming for perf cores. As long as it remains optional good uarch can avoid the pitfall. I also think RV is on dangerous territory with this one. Yes arm learned this the hard way with thumb and apple ripped it out even sooner. BTB pollution
2
0
2
@divBy_zero
Eric Quinnell
2 years
uArch uLesson #2 -- Hot Ones Diagram of hot ones comes from IBM's CELL fpu paper. Radix-4 booth encoders are still in vogue due to easy power-of-two shifting x(-2, -1, 0, 1, 2), with the 2's complement +1 added in "hot ones" in later partial products
Tweet media one
0
0
2
@divBy_zero
Eric Quinnell
5 months
@FelixCLC_ 100Gbps of network connectivity?! Slow down there, cowboy. So like…one NIC?
0
0
2
@divBy_zero
Eric Quinnell
1 year
@_akhaliq Trade them in for some Dojos
0
0
2
@divBy_zero
Eric Quinnell
2 months
New idea — we can prevent the AI robot apocalypse by installing VMs as their default OS!
0
0
3
@divBy_zero
Eric Quinnell
27 days
@entropicEm During the talk I pointed out the all reduce is a mere characterization of the network in isolation. You are correct, our SOW all reduce on wafer is insanely fast. Combined with TTPoE and pipelining, our full all reduce in system is ridiculously good
0
0
2
@divBy_zero
Eric Quinnell
1 year
Alternative uArchs are prolific in literature and industry, all finding ways to avoid the tyranny of the addend alignment. But sometimes the original uArch remains the best choice, especially when needing minimum area density and choosing throughput over latency
Tweet media one
0
0
2
@divBy_zero
Eric Quinnell
4 months
@damageboy @TheKanter @jordanschnyc @dylan522p @_fabknowledge_ @asianometry To be fair, SVE is ARM’s ISA stupid tax. Addressing modes is RV’s. So they all have one. Var length punishes most
2
0
2
@divBy_zero
Eric Quinnell
9 months
@rich_coupe @kenshirriff The sense amp is usually at the end of the precharged shared bit columns of rows of 6Ts to accelerate the readout. The 4T shared is a false 4T, any “resistors” on modern nodes are most commonly themselves transistors as well
1
0
2
@divBy_zero
Eric Quinnell
2 years
uArch uLesson #3 -- Floating-Point Rounding IEEE-754 has four rounding requirements, but how to implement them? By special bits around the LSB or ULP combined with the sign bit The Guard, Round, Sticky, and Carry Bits
Tweet media one
1
0
2
@divBy_zero
Eric Quinnell
2 months
@rzidane360 @FelixCLC_ “It’s only 2 gate delays!”
0
0
2
@divBy_zero
Eric Quinnell
22 days
@CamelCdr @Cardyak Fair enough, just pointing out it is definitely a different direction. Btw your points make great discussion and I was initially replying to lots — but will give space to all the other techies who clearly showed up. I had my turn :)
1
0
2
@divBy_zero
Eric Quinnell
1 year
@bernhardsson Bc @FFmpeg community just gave up on codecs that get more than 1 IPC
0
0
2
@divBy_zero
Eric Quinnell
2 years
A picture is worth 1,000 words. A video is 60 pictures per second. Does that make visual AI worth 60,000 ChatGPTs?
0
0
2
@divBy_zero
Eric Quinnell
2 years
Every twitter python code sharing example. import theThing theThing.doTheThing() The insight is just incredible.
@clcoding
Python Coding
2 years
Day 143 : Python code for creating a joy plot
Tweet media one
4
88
500
0
0
2
@divBy_zero
Eric Quinnell
8 months
@radshaan “Learning GNU Emacs” and “Mastering Regular Expressions” are not books one internalizes. This collection is mostly classroom books or reference manuals, not stuff you sit with a glass of wine by the fire
0
0
2
@divBy_zero
Eric Quinnell
16 days
@davorVDR So @tenstorrent voted exponent off the island then for FP2! Ok, so when’s FP1 :) Choosing max exponent and normalizing the mantissa is clever, you capture the range +/- the mantissa width and some precision. Is there an equal/opposite block float where mantissa is shared and
1
0
2
@divBy_zero
Eric Quinnell
9 months
@itsclivetime Agreed. This is closer to SRT algorithm vs the newton/raphson method (multiplier based). This will be much smaller in LUTs at expense of latency. You could modify the above to a higher radix (try radix 16) and make it even faster with little extra gates.
0
0
1
@divBy_zero
Eric Quinnell
5 months
@HaroldAptroot @st01014 Avg cpu freq 4.7GHz. Here’s the blind spot. Run it at 2GHz bc that’s sustainable beyond a benchmark
1
0
1
@divBy_zero
Eric Quinnell
3 months
@YunTaTsai1 All the ones that are 3 years or older
0
0
1
@divBy_zero
Eric Quinnell
2 years
Auto-correlation shows how relevant each previous branch in the kernel history is to predicting the current branch
Tweet media one
1
0
1
@divBy_zero
Eric Quinnell
2 years
Round to Nearest Even example 2, the corner case
Tweet media one
0
0
1
@divBy_zero
Eric Quinnell
6 months
@glennklockwood FP64 appears in all workloads where you’re in a rush with no time to weed out why someone put it in Academic papers on quad precision datapaths rely on quad precision
0
0
1
@divBy_zero
Eric Quinnell
8 months
@rpoo Urgency is directly proportional to stress. And directly proportional to results. Knowingly choose that path and all it comes with
0
0
1
@divBy_zero
Eric Quinnell
22 days
@bjorntopel Yes it was recorded, but for Berkeley only. I’d rather not have grad students’ questions or arguments vetted by public forums. My arguments? Go for it, internet
0
0
1
@divBy_zero
Eric Quinnell
5 months
@Prakhar6200 Noted, as well as your citations of no SMT and isa trying to move to fixed width and removing some old junk (A20 bit etc). It’s moving the right direction after useless adventures into memory transactions and AVX-width-so-big-didt-throttle-oops wastes of time. But many
1
0
1
@divBy_zero
Eric Quinnell
2 years
@itsclivetime It really likes LeBron
Tweet media one
0
0
1
@divBy_zero
Eric Quinnell
2 months
@RyanEls4 This post may not get a reliable return
0
0
1
@divBy_zero
Eric Quinnell
11 days
@essobi @FFmpeg @NVIDIAHPCDev You had me at free bsd tattoo. But ya, don’t ever stop optimizing for coalesced datapaths. Bc that would be really dumb, and you don’t look dumb
0
0
1
@divBy_zero
Eric Quinnell
8 months
A colleague mentioned to me it is bc losses are differentiable and can be minimized, as opposed to gains that can go to infinity and not be differentiated
0
0
1