Abhi Venigalla Profile Banner
Abhi Venigalla Profile
Abhi Venigalla

@ml_hardware

5,600
Followers
1,315
Following
75
Media
913
Statuses

Researcher @Databricks . Former @MosaicML , @CerebrasSystems . Addicted to all things compute.

San Francisco, CA
Joined October 2018
Don't wanna be here? Send us removal request.
Pinned Tweet
@ml_hardware
Abhi Venigalla
6 months
We built a new model! 🧱 It's called DBRX 🧱 * mixture of experts * 16 choose 4 experts * 36B active, 132B total * trained on 12T tokens * built e2e in 2 months * using 3072xH100 * served up to 150 tok/s on @Databricks * open weights :)
10
41
375
@ml_hardware
Abhi Venigalla
1 year
Ready for GPU independence weekend? PyTorch 2.0 and LLM Foundry now work out of the box on ** AMD GPUs! ** We profiled MPT 1B-13B models on AMD MI250 and saw perf within 80% of A100-40GB, which could go up to 94% with better software. It. Just. Works.
22
212
1K
@ml_hardware
Abhi Venigalla
1 year
And yes, you can switch back and forth between NVIDIA and AMD, even within a single training run. It's Christmas in July!🎄
Tweet media one
9
46
421
@ml_hardware
Abhi Venigalla
11 months
Back in June we @MosaicML showed that our LLM Foundry training stack runs seamlessly on @AMD MI250 GPUs. Today, I'm happy to share that we've scaled up to 128xMI250, with great multi-node performance!
Tweet media one
14
61
392
@ml_hardware
Abhi Venigalla
6 months
This is literally my new LK-99 🙏🙏🙏
@aaron_defazio
Aaron Defazio
6 months
Update: more experimental results rolling in. Here it is against SGD with both the step-wise and cosine schedule (both baselines heavily tuned, no cheating) This is something special indeed!
Tweet media one
48
62
626
6
27
323
@ml_hardware
Abhi Venigalla
2 years
We're coming for all the models! This week our Vision team profiled Stable Diffusion on @MosaicML Cloud and found that training from scratch costs <$160k, and can be done in under 2 weeks.
12
35
244
@ml_hardware
Abhi Venigalla
2 years
@karpathy The @MosaicML perf team just tried this out and... totally confirmed 🤯 GPT-1.3B MFU went from 49% -> 53%
Tweet media one
5
10
196
@ml_hardware
Abhi Venigalla
6 months
If you have apple silicon and > 70GB of RAM, you can run DBRX on your laptop!! Kudos to @awnihannun :)
7
20
185
@ml_hardware
Abhi Venigalla
1 year
Our Vision team is insane. The original Stable Diffusion reportedly cost $600k... and now we've reproduced it for $50k🤯 and it took <1 week to train! All the training code is open-source! And we make it super fast + easy to customize on your own private data @MosaicML
@jefrankle
Jonathan Frankle
1 year
And now it's < $50k. 🖼️Announcing @MosaicML 's diffusion offering 📷We replicated Stable Diffusion 2.0, training from scratch with huge speedup, and we can do it on your data too. Human eval showed the model to be indistinguishable from the original. Blog:
8
29
283
2
15
164
@ml_hardware
Abhi Venigalla
6 months
@francoisfleuret The 30x is real and comes from this technical brief, page 15: How is 30x possible given GB200 has only ~2.3x increase in memBW and FLOP/s over H100? It involves comparing per-chip generation throughput = output_tokens/s/chip. The two systems compared are
8
12
156
@ml_hardware
Abhi Venigalla
1 year
@julien_c @julien_c Why is the training so slow? Your screenshot shows 25% MFU. Our users on MosaicML get 40%+ for the same workload on H100s. Screenshot MFU = 6 * 30e9 * 600e9 / 500 / 10 / 3600 / 24 / 1e15 = 0.25 Time to train on HF: 10 days Time to train on MosaicML: * 6.25 days *
7
12
139
@ml_hardware
Abhi Venigalla
10 months
i love you all = ilya
@sama
Sam Altman
10 months
i love you all. today was a weird experience in many ways. but one unexpected one is that it has been sorta like reading your own eulogy while you’re still alive. the outpouring of love is awesome. one takeaway: go tell your friends how great you think they are.
4K
6K
90K
4
4
131
@ml_hardware
Abhi Venigalla
2 years
Been hinting at this blog for a while and it's finally here! The Streaming team @MosaicML has built an open source library (`mosaicml-streaming`) for efficiently loading training data from object stores like S3, GCS, OCI, and more.
3
22
130
@ml_hardware
Abhi Venigalla
1 year
One fun tidbit -- yes with PyTorch you still run `torch.cuda` on AMD systems and yes it does work😆
Tweet media one
2
6
98
@ml_hardware
Abhi Venigalla
1 year
@mustafasuleymn For those curious about the benchmark: It's not quite the full GPT-3, the MLPerf LLM benchmark gives you a checkpoint to start from and then you have to hit a target loss. In practice, it looks like the 11min run was over ** 1.2B ** tokens, not the full 300B tokens:
0
14
93
@ml_hardware
Abhi Venigalla
1 year
Alright who wants to try and guess the compute/cost/params for PaLM2-L? No prizes (b/c obv I don't know) but with enough responses we might get a reasonable estimate (which is reward enough 😝) I'll start: * 6e24 FLOPs * $22M * 250B params paper:
14
8
79
@ml_hardware
Abhi Venigalla
1 year
What about perf? We only had 1 node of 4xMI250, so to compare with our 8xA100 systems we measured per-GPU metrics. With no code changes, perf on MI250 looks really strong! About 80% of A100-40GB. Better FlashAttention for AMD may close the gap (we predict ~94% of A100-40GB)
Tweet media one
2
11
72
@ml_hardware
Abhi Venigalla
1 year
This is all made possible thanks to a software and hardware stack that AMD has been building for years, and is now bearing fruit. Seeing MI250 work so well today brings hope that the MI300x will too when it arrives!
Tweet media one
1
6
69
@ml_hardware
Abhi Venigalla
5 months
@AravSrinivas I dont think there's an improvement in compression, it's just more flops... Like when you plot training flops vs benchmark accuracy, Llama-3-8b-base is much worse than it could have been with the same compute (compare to llama-2-70b-base which is iso flop) because the
4
8
65
@ml_hardware
Abhi Venigalla
1 year
Here's MPT-1B training for 1B tokens on NVIDIA A100 (green) vs. AMD MI250 (red). Can you spot a difference? Both runs use the exact same LLM Foundry code:
Tweet media one
Tweet media two
1
3
64
@ml_hardware
Abhi Venigalla
1 year
Keep an eye on LLM Foundry where we will add pre-built Docker images with ROCm FlashAttention to make the AMD setup process even faster. We'll also be profiling MPT on larger MI250 clusters soon! Lastly, @LisaSu any chance we can get early access to MI300x? 🙏
0
3
59
@ml_hardware
Abhi Venigalla
1 year
Mistral-7B looks pretty fire 🔥 but how many tokens was it trained on? Let's crowd source some estimates! I'll bet ~7T tokens, since it's beating (LLaMa2-13B, 2T tokens) and it's a 7B model, which should be harder to scale. Also tokens=1000x params is a nice round number :)
8
2
57
@ml_hardware
Abhi Venigalla
1 year
soooooo close turns out extrapolating eval scores kinda works!
@ml_hardware
Abhi Venigalla
1 year
Alright who wants to try and guess the compute/cost/params for PaLM2-L? No prizes (b/c obv I don't know) but with enough responses we might get a reasonable estimate (which is reward enough 😝) I'll start: * 6e24 FLOPs * $22M * 250B params paper:
14
8
79
4
3
55
@ml_hardware
Abhi Venigalla
1 year
If we zoom in on the first 100 batches, we get nearly overlapping loss curves. This is crazy given that the runs are on two totally different hardware stacks! StreamingDataset and Composer do a lot heavy lifting for determinism in the dataloader and train loop.
Tweet media one
1
2
53
@ml_hardware
Abhi Venigalla
1 year
Tired of expensive data egress fees? Check out how @Cloudflare R2 object storage + the @MosaicML platform let you stream data + checkpoints with ** zero egress cost ** and train models anywhere.
@CloudflareDev
Cloudflare Developers
1 year
Today, we’re excited to show how @MosaicML tools and Cloudflare R2 can be used together to orchestrate training of LLMs across multiple clouds with zero switching costs.
0
4
44
2
12
49
@ml_hardware
Abhi Venigalla
2 years
Excited to kickoff our LLM blog series here @MosaicML ! Check out how easy and fast it is to train GPT3-1.3B, and stay tuned as we scale up to bigger and better models :D
@DbrxMosaicAI
Databricks Mosaic Research
2 years
Large Language Models (LLM) are gaining in popularity, but training these models from scratch can be a huge pain... until now! Our latest LLM blog series uncovers how to reduce the time, cost, and complexity of training these billion-parameter models:
1
40
251
0
7
51
@ml_hardware
Abhi Venigalla
6 months
Tweet media one
2
0
51
@ml_hardware
Abhi Venigalla
2 years
Check out what our collaborators at @StanfordHAI built using @MosaicML Cloud! We took a regular GPT-2.7B, pretrained it from scratch on PubMed data, and hit SOTA on MedQA-USMLE. Took <1 week and cost <$40k to train. If you want to build your own custom models, talk to us!
@DbrxMosaicAI
Databricks Mosaic Research
2 years
Meet PubMed GPT 🩺 a new SOTA on the US Medical Licensing Exam developed by MosaicML and @StanfordHAI . It's a normal GPT-3B model trained on medical data that bests hand-designed med models and generic models 40x bigger, a sweet spot for foundation models🧵
12
132
518
1
8
46
@ml_hardware
Abhi Venigalla
7 months
What is going on with arc-challenge evals? Lots of great new models report scores in the high 80s-90s in their blogs. But then OSS eval frameworks like @AiEleuther harness and @MosaicML gauntlet seem to report lower scores... Clearest example is Mixtral-8x7B: * blog post: 0.858
1
6
51
@ml_hardware
Abhi Venigalla
1 year
Overall, I'm super optimistic about the future for AI hardware. More options means more compute supply, more market pressure on prices, and lower costs for users. If your hardware supports PyTorch 2.0 too ( @HabanaLabs ???) reach out to us and we would love to showcase it!
1
2
48
@ml_hardware
Abhi Venigalla
1 year
@AudioBooksRU At 50% MFU, this is about 545k A100-days So 10k A100s running for ~2mo but ofc this is Google so probably done on TPU pods
1
1
46
@ml_hardware
Abhi Venigalla
11 months
One ongoing story I'm really excited about is the Triton compiler, which AMD has been investing a lot into. The end result: you can write 1 Triton kernel, and run it at high perf on NVIDIA or AMD GPUs! Here's the current (fwd) perf of a Triton FA-2 kernel on A100 vs. MI250:
Tweet media one
2
7
47
@ml_hardware
Abhi Venigalla
1 year
no taxation without representation 🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸 no computation without competition
3
4
45
@ml_hardware
Abhi Venigalla
1 year
@sharifshameem In the olden days they trained 125M models for 300B tokens (2400x) This was probably too much haha... Chinchilla recommendation is ~20x LLaMa2-7B is 7B params for 2T tokens (286x), and doesn't look saturated! But somewhere between 286x and 2400x it's probably enough :P
Tweet media one
Tweet media two
1
1
43
@ml_hardware
Abhi Venigalla
10 months
Honestly insane that you can get this close <3
Tweet media one
2
0
43
@ml_hardware
Abhi Venigalla
2 years
Our LLM stack got a big upgrade this week thanks to @DohmannJeremy 🎉 In-Context-Learning eval is now super fast and scalable. My favorite part -- it's so fast, we can measure metrics like LAMBADA/PIQA *live* while training, no need to checkpoint/use a separate eval harness.
Tweet media one
1
1
38
@ml_hardware
Abhi Venigalla
10 months
iykyk
Tweet media one
5
0
38
@ml_hardware
Abhi Venigalla
1 year
@andrew_n_carr See below for details, basically it is GPT3-175B trained for ** 1.2B ** tokens, not the full 300B tokens. Would be ~2 days for the full run, still super impressive!
@ml_hardware
Abhi Venigalla
1 year
@mustafasuleymn For those curious about the benchmark: It's not quite the full GPT-3, the MLPerf LLM benchmark gives you a checkpoint to start from and then you have to hit a target loss. In practice, it looks like the 11min run was over ** 1.2B ** tokens, not the full 300B tokens:
0
14
93
1
2
36
@ml_hardware
Abhi Venigalla
11 months
@ylecun @xlr8harder @geoffreyhinton 100x is roughly 1e27 FLOPs H100 can do 500e12 FLOP/s (at conservative 0.25 MFU) 1e27/500e12/3600/24 = 23M H100-days Aka 200k H100s for ~ 4 months This will be done within 18mo
4
2
37
@ml_hardware
Abhi Venigalla
2 years
@robertnishihara @Meta Sorry but I think the throughput comparison is incorrect. Meta is reporting 147 **Model** TFLOPs/GPU and you are reporting 179 **Hardware** TFLOPs/GPU. In your blog, Ray+JAX+Alpa only achieves 102.4 **Model** TFLOPs/GPU. Which is ~30% slower than Meta + PyTorch + FSDP.
4
1
36
@ml_hardware
Abhi Venigalla
11 months
Thanks to the software+hardware stack that AMD has been buildling, we didn't have to make any code changes -- we just ran LLM Foundry with the ROCm 5.7 + PyTorch docker image, and everything just works.
Tweet media one
1
2
36
@ml_hardware
Abhi Venigalla
4 months
@cHHillee @finbarrtimbers +++ 1) is the biggest issue Training transformers = big dense matmuls And GPUs are already close to the limit for big dense matmuls. There are no 10x's on the table. Not even 2x's imo. If you want a training 10x, you need to *change the workload* away from dense matmuls
3
1
35
@ml_hardware
Abhi Venigalla
1 year
Alright folks you know what time it is 🎉 ML FASHION WEEK 🎉 Day 1: Hopper Cowboys, vintage 2023. Like New. Attainable only through blood sacrifice to H100s
Tweet media one
1
3
35
@ml_hardware
Abhi Venigalla
6 months
@abacaj OTOH, if your desire is to use totally-local ~10B models, then yes the quality is looking pretty saturated to me. I've said it before but when open GPT-4 class models come out later this year, I expect them to be in the 300B-500B param range. At least. Nobody's ready for it...
2
3
33
@ml_hardware
Abhi Venigalla
1 year
Will be in ICML next week Mon - Sun! ✈️ DM me if you want to catch up and talk about LLMs, AI hardware, or efficient training/inference!
0
6
34
@ml_hardware
Abhi Venigalla
1 year
🎉 ML FASHION WEEK🎉 Day 2: Scaling Laws, circa 2023. You can buy this one online! Kudos @antimatter15 for the design, and @GoogleDeepMind for the math
Tweet media one
@ml_hardware
Abhi Venigalla
1 year
Alright folks you know what time it is 🎉 ML FASHION WEEK 🎉 Day 1: Hopper Cowboys, vintage 2023. Like New. Attainable only through blood sacrifice to H100s
Tweet media one
1
3
35
1
1
34
@ml_hardware
Abhi Venigalla
2 years
I’m in New Orleans for #NeurIPS this week! Hmu if you want to eat oysters and talk about LLMs 😊
3
3
32
@ml_hardware
Abhi Venigalla
3 years
Excited to share what we do @MosaicML ! We build and compose methods to train better ML models faster. We are very focused on efficiency as a developer sees it: [Quality] / [$, Time] (also: 10mo ago I read a tweet, DM'd a stranger, joined a startup, and I couldn't be happier :)
@NaveenGRao
Naveen Rao
3 years
10 months ago I tweeted that we were getting a new project off the ground…today I’m proud to announce with @hanlintang @jefrankle and @mcarbin that MosaicML comes out of stealth! We are focused on the compute efficiency of neural network training using algorithmic methods.👇
26
19
160
1
4
31
@ml_hardware
Abhi Venigalla
1 year
@jefrankle @togethercompute Made possible thanks to our lightning fast eval framework :)
2
2
30
@ml_hardware
Abhi Venigalla
1 year
"LLMs can't do long context length" 👀👀👀
@NaveenGRao
Naveen Rao
1 year
🤯🤯 LLM trained with 64K+ context length! What could you do with that? Prompted our model with the ENTIRE contents of "The Great Gatsby" and asked it to write the epilogue. Snippet 👇 Model dropping soon to an open-source repo near you. Epilogue: It seemed to me that Gatsby
41
89
675
1
5
30
@ml_hardware
Abhi Venigalla
1 year
@ryan_landay @realGeorgeHotz I think George has been working on consumer AMD GPUs, this effort was totally separate. Basically we found that PyTorch worked out of the box on MI250. Took <1 hr from getting the hardware to training MPT.
1
0
30
@ml_hardware
Abhi Venigalla
1 year
@NCSilberman @soumithchintala @MosaicML @PyTorch We plan to do an MI300x vs. H100 comparison when the time is right. We wanted to compare "same-generation" for now so MI250 vs. A100. We also separately did some H100 profiling in April:
0
3
29
@ml_hardware
Abhi Venigalla
1 year
@realGeorgeHotz any chance we could try this out on a tinybox? If PyTorch 2.0 + ROCm 5.4 installs then I think there's a good chance!
3
2
26
@ml_hardware
Abhi Venigalla
1 year
@bernhardsson Yes but the 2nd order effect is what @cHHillee wrote excellently here: TL;DR * yes you could fit large models on one node * but training would take centuries * so you get a cluster of GPUs to compress time * but you cant scale BS forever without hurting
@cHHillee
Horace He
1 year
@AravSrinivas @_mohansolo @bernhardsson Technically, the dependency actually goes the other way. The more tokens you have, the worse interconnect bandwidth you can get away with. In practice, this statement is true: "The larger/better the model you want to train is, the better your interconnect needs to be." But the
2
2
31
1
0
28
@ml_hardware
Abhi Venigalla
1 year
@bernhardsson I dont think its much related to model size, as others have commented. Its just that as you train bigger models, you start scaling past single node, so you *notice* the BW reqs To train a small GPT-125M efficiently multinode data parallel you want a device BS ~ 8k tok/A100 and
2
1
27
@ml_hardware
Abhi Venigalla
2 years
@karpathy @MosaicML We had one of these moments a few days ago re. microbatch size… most trainers require MBS to be a divisor of global batch size, so if 8 OOMs you have to go down to 4… …but Composer doesn’t have any limitation. So we tried MBS=7 and it works fine and gets you +2% MFU 🤣
1
0
27
@ml_hardware
Abhi Venigalla
11 months
Overall, I could not be more excited for @AMD 's future in AI hardware. Looking ahead to what we know publicly about MI300X... there might be a new king in town soon! 👑 Follow us @MosaicML and @databricks for more news, and ask your CSPs about MI300X!
Tweet media one
1
4
27
@ml_hardware
Abhi Venigalla
1 year
Holy moly, 9 supercomputers with 64xCS-2 each!!!
@CerebrasSystems
Cerebras
1 year
📣 Today we are announcing Condor Galaxy-1: a 4 exaflop AI supercomputer built in partnership with @G42ai . Powered by 64 Cerebras CS-2 systems, 54M cores, and 82TB of memory – it's the largest AI supercomputer we've ever built. But that's not all: CG-1 is just the start..
Tweet media one
18
85
366
1
0
27
@ml_hardware
Abhi Venigalla
1 year
@HamelHusain Just want to caveat -- device_map puts different layers on different GPUs to spread the memory, but * it does not split the work * aka the model forward is not sped up at all. The right solution is tensor parallelism but more manual to implement. Torch2 will help solve this
Tweet media one
1
0
26
@ml_hardware
Abhi Venigalla
6 months
@abacaj imo it's simultaneously: * building great models takes lots of data and compute and tuning and crazy infra. It IS getting harder * BUT rather than keep all this infra to ourselves, we are making it a *product*. It's the foundry, not the model So anyone can come to Databricks
1
5
26
@ml_hardware
Abhi Venigalla
1 year
@LigengZhu @NaveenGRao We'll share more details later this week, but basically, with the right framework and arch it just kinda works. FSDP, activation checkpointing and microbatch_size=1. We were profiling toy models like this one back in October. Surprised no one else trained a full one yet :)
3
2
25
@ml_hardware
Abhi Venigalla
3 years
Lots of useful tips in here for efficient CNN training! * mixed precision * NHWC * stepwise learning rates * balancing CPU/GPU data aug * Pillow-SIMD >> Pillow
@DbrxMosaicAI
Databricks Mosaic Research
3 years
New blog post! Take a look at some best practices for efficient CNN training, and find out how you can apply them easily with our Composer library: #EfficientML
1
18
80
0
7
24
@ml_hardware
Abhi Venigalla
1 year
@suchenzang 1)+2), a lot has improved in the past few months. Many providers ( @MosaicML included) will rent out 512+ A100s for a few weeks-months. We call it a "hero run" TL;DR renting 1024xA100 with 800+ Gbps for 70 days (LLaMa2-70B) is def possible without 1yr commit. Pricing TBD ofc
3
0
25
@ml_hardware
Abhi Venigalla
6 months
@agihippo reka weights when?
1
1
23
@ml_hardware
Abhi Venigalla
6 months
@aaron_defazio This is dark magic... how is it so close to the pareto frontier throughout? Does this mean you can stop at any timestep and continue training without rewarming/doing something weird to the LR?
3
1
24
@ml_hardware
Abhi Venigalla
6 months
@simonw @zpete It has to do with KV cache allocation -- you have a finite amount of GPU memory, and when a request comes in you dont know when the generation will end, so you have to preallocate up to max_output_tokens. The bigger that value, the fewer concurrent users you can handle.
4
0
23
@ml_hardware
Abhi Venigalla
2 years
@NamanGoyal21 On top of this, choosing better architectures + recipes can get you to the same quality much faster. Think Chinchilla + UL2 + MoE + Instruction-finetuning + ... By end of 2023, the cost to train a GPT-3 quality model will be less than $100k. Maybe a lot less.
2
3
24
@ml_hardware
Abhi Venigalla
2 years
@EMostaque Strongly agree. By 2033, it will be closer to 1,000,000x improvement. Even Jensen talks about 1,000,000x in the next decade and that’s only NVIDIA’s contribution! Better algorithms will push us much farther…
0
1
24
@ml_hardware
Abhi Venigalla
1 year
@Dantali84254624 got even closer on the param estimate!
@Dantali84254624
Dantalion
1 year
@abhi_venigalla I'd go a bit bigger, 1e25 flops 331B paramters $50m
1
0
11
0
0
23
@ml_hardware
Abhi Venigalla
11 months
First of all, what performance are we seeing today? On single-node (4xMI250), we see up to 165 TFLOP/s/GPU, up from 144 TFLOP/s/GPU in June. This is all thanks to software improvements in ROCm and a new FlashAttention-2 kernel from AMD.
Tweet media one
1
2
22
@ml_hardware
Abhi Venigalla
5 months
@AravSrinivas One last idea -- there are two very distinct types of oss users 1) ppl at home trying to run a model for themselves, on like a consumer gpu rig or laptop 2) model hosting providers, who run on cloud gpus and offer apis to their customers My read is type 1) prefers dense but
0
0
21
@ml_hardware
Abhi Venigalla
9 months
Now the story gets even better for LLM inference... First, we find that on LLaMa2-70B with 8-way TP and BF16, the 8xGaudi2 system ~matches the 8xH100 in decoding pace! The Gaudi2 sofware+hardware stack is able to acheive higher utilization of its memory bandwidth.
Tweet media one
Tweet media two
1
5
21
@ml_hardware
Abhi Venigalla
2 years
@sytelus Honestly? It's because they aren't yet available to rent in a major cloud. If that was true, you would hear many more folks trying them out and doing heads-up with A100s. The native PyTorch support is sweet!
2
0
21
@ml_hardware
Abhi Venigalla
10 months
So who else is in South Padre tonight :D
Tweet media one
2
0
21
@ml_hardware
Abhi Venigalla
1 year
@giffmana If you use our streaming library you’ll get deterministic resumption and won’t have this problem :) even if changing world sizes
1
0
20
@ml_hardware
Abhi Venigalla
9 months
At Databricks GenAI we help customers build+deploy AI, on their own data, as efficiently as possible. Using the right HW for each workload is a big piece of that, so we'll continue to test, validate, and integrate new ML HW. If working on this excites you, DM me or @dskhudia !🙌
2
3
20
@ml_hardware
Abhi Venigalla
11 months
Last thing: a fun throwback if you haven't seen it before :D
@ml_hardware
Abhi Venigalla
1 year
And yes, you can switch back and forth between NVIDIA and AMD, even within a single training run. It's Christmas in July!🎄
Tweet media one
9
46
421
2
0
20
@ml_hardware
Abhi Venigalla
1 year
@realGeorgeHotz Hey @realGeorgeHotz , if you want a baseline for LLM training, i'm like 80% sure our repo will work out of the box on AMD GPUs. We just use PyTorch and FSDP. I don't have any AMD GPUs myself rn but we've had community members report that it works. Re.
2
1
20
@ml_hardware
Abhi Venigalla
11 months
You can find more info on AMD+Triton in their slides from this year's inaugural Triton Dev Conference:
1
1
20
@ml_hardware
Abhi Venigalla
1 year
Want to train even faster + cheaper? Check out our profiling results for LLMs on @nvidia H100! Already 3x faster with our current stack training in FP8
@DbrxMosaicAI
Databricks Mosaic Research
1 year
How good are @nvidia H100s actually? In collaboration with @CoreWeave , we benchmarked A100 vs H100 performance for large language model training. Here's what we found: [1/6]
2
51
226
1
2
19
@ml_hardware
Abhi Venigalla
3 years
@rabois Is that why I saw you @ blue bottle in fidi today 👀
1
2
19
@ml_hardware
Abhi Venigalla
1 year
@cis_female Try increasing layer widths and I bet you itll hit over 50%!
2
0
17
@ml_hardware
Abhi Venigalla
11 months
And quick shoutout to @cpuhrsch and @cHHillee , who showed just 2 weeks ago @PyTorch Conference how to use `torch.compile()`, (which uses Triton) to run LLM inference on NVIDIA or AMD using the same PyTorch code! (left=1xA100, right=half of 1xMI250)
Tweet media one
Tweet media two
1
3
17
@ml_hardware
Abhi Venigalla
1 year
You can even switch compute providers mid-run seamlessly! Deterministic, elastic, fast and zero switching cost.
Tweet media one
1
2
16
@ml_hardware
Abhi Venigalla
1 year
We've been working around the clock to build both the models ** and the infrastructure ** to make LLM training easy, efficient, and stress-free. Here's the MPT-7B training logbook. It's empty! Over 9.5 days on 440xA100s, there was zero human intervention. No more babysitting!
Tweet media one
Tweet media two
2
0
16
@ml_hardware
Abhi Venigalla
3 years
Awesome CV data-loading speedups! Seems very useful for keeping up with ever-faster GPUs like the A100. Small, fast models like ResNet18 also finally get unblocked. Also pumped to see @MosaicML numbers in the plots!
@aleks_madry
Aleksander Madry
3 years
ImageNet is the new CIFAR! My students made FFCV (), a drop-in data loading library for training models *fast* (e.g., ImageNet in half an hour on 1 GPU, CIFAR in half a minute). FFCV speeds up ~any existing training code (no training tricks needed) (1/3)
Tweet media one
29
390
2K
0
1
17
@ml_hardware
Abhi Venigalla
2 years
Thanks @karlfreund for the shoutout! We’re really excited about our BERT pre-training speedups. We can’t wait to try them out on H100s and break even more records :)
Tweet media one
2
2
17
@ml_hardware
Abhi Venigalla
1 year
🎉 ML FASHION WEEK 🎉 Day 3: Today’s all about ML hardware! My personal favorites are the Hot Chips socks (2020) and Dojo tshirt (Tesla AI Day 2022). Crop top is optional
Tweet media one
@ml_hardware
Abhi Venigalla
1 year
🎉 ML FASHION WEEK🎉 Day 2: Scaling Laws, circa 2023. You can buy this one online! Kudos @antimatter15 for the design, and @GoogleDeepMind for the math
Tweet media one
1
1
34
2
0
17
@ml_hardware
Abhi Venigalla
1 year
Our *open-source* LLM training stack keeps getting faster! And this is all done with pure FSDP and a custom HF model. No fancy parallelization necessary.
@vitaliychiley
Vitaliy Chiley
1 year
Updated @MosaicML LLM training throughput tables. Here are some highlights: - Best HFU: 73.63%!!! 🚀 13B w/ act ckpt - Best MFU: 62.09%!!! 🔥 3B w/out act ckpt - Train with SeqLen 65k 🤯 Details here: [1/5]
Tweet media one
5
21
146
0
1
17
@ml_hardware
Abhi Venigalla
6 months
Less than a year ago we were still training our first MPT-7B... Now we're building and serving MoEs at scale and just getting started :) DBRX is a demonstration of what our AI foundry can do. If your organization wants to build a custom AI model, come to Databricks and use all
1
0
17