Abhi Venigalla @ml_hardware Twitter profile

Pinned Tweet

Abhi Venigalla

6 months

We built a new model! 🧱 It's called DBRX 🧱 * mixture of experts * 16 choose 4 experts * 36B active, 132B total * trained on 12T tokens * built e2e in 2 months * using 3072xH100 * served up to 150 tok/s on @Databricks * open weights :)

10

41

375

Last Seen Profiles

@ihu1227347

@bastian

@RONatsBaseball

@TalkNerdy2MeFam

@abo_almally

@Wow_platformUS

@duyhip91

@sosit_a

@BandarStw

@iuanks

@sissyslavecindy

@ChasingLabels

@xxRUXUxx

@kaylin50998

@lexiiigirllll

@cheesecake__ii

@Nhal26923921

@quazzdeeduck3

@jenniferme53397

@conghien1990q

@nicwid_

@Kitty_Creations

@WolfrealEsports

@aniramesh5

@MarioFe78867208

@NeilAndrews65

@Shirihuru

@Bkpb1VLmBJqhNBi

@Tsuubaaakii

@alexiis1988

@alex_cawood

@Rayann933

@AurgLeight

@NiYOtanEN

@Emmy_AM5

@kantrifnn

Abhi Venigalla

@ml_hardware

1 year

Ready for GPU independence weekend? PyTorch 2.0 and LLM Foundry now work out of the box on ** AMD GPUs! ** We profiled MPT 1B-13B models on AMD MI250 and saw perf within 80% of A100-40GB, which could go up to 94% with better software. It. Just. Works.

Training LLMs with AMD MI250 GPUs and MosaicML | Databricks Blog

With the release of PyTorch 2.0 and ROCm 5.4, we are excited to announce that LLM training works out of the box on AMD MI250 accelerators with zero code changes and at high performance!

www.databricks.com

22

212

1K

Abhi Venigalla

@ml_hardware

1 year

CNBC leaks PaLM2-L training config, says it is: * 340B params * 3.6T tokens * 7.3e24 FLOPs using the (6*N*D) approx

Google's newest A.I. model uses nearly five times more text data for training than its predecessor

In announcing its PaLM 2 large language model, Google neglected to say how much training data was used for its most advanced LLM.

www.cnbc.com

18

119

803

Abhi Venigalla

@ml_hardware

1 year

And yes, you can switch back and forth between NVIDIA and AMD, even within a single training run. It's Christmas in July!🎄

9

46

421

Abhi Venigalla

@ml_hardware

11 months

Back in June we @MosaicML showed that our LLM Foundry training stack runs seamlessly on @AMD MI250 GPUs. Today, I'm happy to share that we've scaled up to 128xMI250, with great multi-node performance!

14

61

392

Abhi Venigalla

@ml_hardware

6 months

This is literally my new LK-99 🙏🙏🙏

Aaron Defazio

@aaron_defazio

6 months

Update: more experimental results rolling in. Here it is against SGD with both the step-wise and cosine schedule (both baselines heavily tuned, no cheating) This is something special indeed!

48

62

626

6

27

323

Abhi Venigalla

@ml_hardware

2 years

We're coming for all the models! This week our Vision team profiled Stable Diffusion on @MosaicML Cloud and found that training from scratch costs <$160k, and can be done in under 2 weeks.

Mosaic Research | Databricks Blog

Latest blogs from the team at Mosaic Research

www.databricks.com

12

35

244

Abhi Venigalla

@ml_hardware

2 years

@karpathy The @MosaicML perf team just tried this out and... totally confirmed 🤯 GPT-1.3B MFU went from 49% -> 53%

5

10

196

Abhi Venigalla

@ml_hardware

6 months

If you have apple silicon and > 70GB of RAM, you can run DBRX on your laptop!! Kudos to @awnihannun :)

mlx-community/dbrx-instruct-4bit · Hugging Face

huggingface.co

7

20

185

Abhi Venigalla

@ml_hardware

1 year

Our Vision team is insane. The original Stable Diffusion reportedly cost $600k... and now we've reproduced it for $50k🤯 and it took <1 week to train! All the training code is open-source! And we make it super fast + easy to customize on your own private data @MosaicML

Jonathan Frankle

@jefrankle

1 year

And now it's < $50k. 🖼️Announcing @MosaicML 's diffusion offering 📷We replicated Stable Diffusion 2.0, training from scratch with huge speedup, and we can do it on your data too. Human eval showed the model to be indistinguishable from the original. Blog:

8

29

283

2

15

164

Abhi Venigalla

@ml_hardware

6 months

@francoisfleuret The 30x is real and comes from this technical brief, page 15: How is 30x possible given GB200 has only ~2.3x increase in memBW and FLOP/s over H100? It involves comparing per-chip generation throughput = output_tokens/s/chip. The two systems compared are

nvidia-blackwell-architecture-technical-brief.pdf

nvdam.widen.net

8

12

156

Abhi Venigalla

@ml_hardware

1 year

@julien_c @julien_c Why is the training so slow? Your screenshot shows 25% MFU. Our users on MosaicML get 40%+ for the same workload on H100s. Screenshot MFU = 6 * 30e9 * 600e9 / 500 / 10 / 3600 / 24 / 1e15 = 0.25 Time to train on HF: 10 days Time to train on MosaicML: * 6.25 days *

7

12

139

Abhi Venigalla

@ml_hardware

9 months

New year, new MME 🎉 @dskhudia and I profiled @Intel Gaudi2 accelerators for LLM training and inference, and found great performance and perf/$ !

LLM Training and Inference with Intel Gaudi 2 AI Accelerators | Databricks Blog

We benchmarked LLM training and inference on an Intel Gaudi2 cluster and found that it delivered great single-node and multi-node performance.

www.databricks.com

6

32

137

Abhi Venigalla

@ml_hardware

10 months

i love you all = ilya

Sam Altman

@sama

10 months

i love you all. today was a weird experience in many ways. but one unexpected one is that it has been sorta like reading your own eulogy while you’re still alive. the outpouring of love is awesome. one takeaway: go tell your friends how great you think they are.

4K

6K

90K

4

131

Abhi Venigalla

@ml_hardware

2 years

Been hinting at this blog for a while and it's finally here! The Streaming team @MosaicML has built an open source library (`mosaicml-streaming`) for efficiently loading training data from object stores like S3, GCS, OCI, and more.

MosaicML StreamingDataset: Fast, Accurate Streaming of Training Data from Cloud Storage | Databri...

Loading your training data becomes an escalating challenge as datasets grow bigger in size and the number of nodes scales. We built StreamingDataset to make training on large datasets from cloud...

www.databricks.com

3

22

130

Abhi Venigalla

@ml_hardware

1 year

Can’t build a strong foundation without some bricks!

Databricks picks up MosaicML, an OpenAI competitor, for $1.3B | TechCrunch

MosaicML will become a part of the Databricks Lakehouse Platform, providing generative AI tooling alongside the Databricks' existing multi cloud offerings.

techcrunch.com

7

12

109

Abhi Venigalla

@ml_hardware

1 year

One fun tidbit -- yes with PyTorch you still run `torch.cuda` on AMD systems and yes it does work😆

2

6

98

Abhi Venigalla

@ml_hardware

1 year

@mustafasuleymn For those curious about the benchmark: It's not quite the full GPT-3, the MLPerf LLM benchmark gives you a checkpoint to start from and then you have to hit a target loss. In practice, it looks like the 11min run was over ** 1.2B ** tokens, not the full 300B tokens:

0

14

93

Abhi Venigalla

@ml_hardware

1 year

Alright who wants to try and guess the compute/cost/params for PaLM2-L? No prizes (b/c obv I don't know) but with enough responses we might get a reasonable estimate (which is reward enough 😝) I'll start: * 6e24 FLOPs * $22M * 250B params paper:

14

8

79

Abhi Venigalla

@ml_hardware

1 year

What about perf? We only had 1 node of 4xMI250, so to compare with our 8xA100 systems we measured per-GPU metrics. With no code changes, perf on MI250 looks really strong! About 80% of A100-40GB. Better FlashAttention for AMD may close the gap (we predict ~94% of A100-40GB)

2

11

72

Abhi Venigalla

@ml_hardware

1 year

This is all made possible thanks to a software and hardware stack that AMD has been building for years, and is now bearing fruit. Seeing MI250 work so well today brings hope that the MI300x will too when it arrives!

1

6

69

Abhi Venigalla

@ml_hardware

5 months

@AravSrinivas I dont think there's an improvement in compression, it's just more flops... Like when you plot training flops vs benchmark accuracy, Llama-3-8b-base is much worse than it could have been with the same compute (compare to llama-2-70b-base which is iso flop) because the

4

8

65

Abhi Venigalla

@ml_hardware

1 year

Here's MPT-1B training for 1B tokens on NVIDIA A100 (green) vs. AMD MI250 (red). Can you spot a difference? Both runs use the exact same LLM Foundry code:

1

3

64

Abhi Venigalla

@ml_hardware

1 year

Keep an eye on LLM Foundry where we will add pre-built Docker images with ROCm FlashAttention to make the AMD setup process even faster. We'll also be profiling MPT on larger MI250 clusters soon! Lastly, @LisaSu any chance we can get early access to MI300x? 🙏

0

3

59

Abhi Venigalla

@ml_hardware

1 year

Mistral-7B looks pretty fire 🔥 but how many tokens was it trained on? Let's crowd source some estimates! I'll bet ~7T tokens, since it's beating (LLaMa2-13B, 2T tokens) and it's a 7B model, which should be harder to scale. Also tokens=1000x params is a nice round number :)

8

2

57

Abhi Venigalla

@ml_hardware

1 year

So excited to share our new MPT-7B model series! This is an open-source, commercially licensed LLM that matches the quality of LLaMa-7B.

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs | Databricks Blog

Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and...

www.databricks.com

1

9

56

Abhi Venigalla

@ml_hardware

1 year

soooooo close turns out extrapolating eval scores kinda works!

Abhi Venigalla

@ml_hardware

1 year

Alright who wants to try and guess the compute/cost/params for PaLM2-L? No prizes (b/c obv I don't know) but with enough responses we might get a reasonable estimate (which is reward enough 😝) I'll start: * 6e24 FLOPs * $22M * 250B params paper:

14

8

79

4

3

55

Abhi Venigalla

@ml_hardware

1 year

If we zoom in on the first 100 batches, we get nearly overlapping loss curves. This is crazy given that the runs are on two totally different hardware stacks! StreamingDataset and Composer do a lot heavy lifting for determinism in the dataloader and train loop.

1

2

53

Abhi Venigalla

@ml_hardware

1 year

Tired of expensive data egress fees? Check out how @Cloudflare R2 object storage + the @MosaicML platform let you stream data + checkpoints with ** zero egress cost ** and train models anywhere.

Cloudflare Developers

@CloudflareDev

1 year

Today, we’re excited to show how @MosaicML tools and Cloudflare R2 can be used together to orchestrate training of LLMs across multiple clouds with zero switching costs.

0

4

44

2

12

49

Abhi Venigalla

@ml_hardware

2 years

Excited to kickoff our LLM blog series here @MosaicML ! Check out how easy and fast it is to train GPT3-1.3B, and stay tuned as we scale up to bigger and better models :D

Databricks Mosaic Research

@DbrxMosaicAI

2 years

Large Language Models (LLM) are gaining in popularity, but training these models from scratch can be a huge pain... until now! Our latest LLM blog series uncovers how to reduce the time, cost, and complexity of training these billion-parameter models:

1

40

251

0

7

51

Abhi Venigalla

@ml_hardware

6 months

2

0

51

Abhi Venigalla

@ml_hardware

2 years

Check out what our collaborators at @StanfordHAI built using @MosaicML Cloud! We took a regular GPT-2.7B, pretrained it from scratch on PubMed data, and hit SOTA on MedQA-USMLE. Took <1 week and cost <$40k to train. If you want to build your own custom models, talk to us!

Databricks Mosaic Research

@DbrxMosaicAI

2 years

Meet PubMed GPT 🩺 a new SOTA on the US Medical Licensing Exam developed by MosaicML and @StanfordHAI . It's a normal GPT-3B model trained on medical data that bests hand-designed med models and generic models 40x bigger, a sweet spot for foundation models🧵

12

132

518

1

8

46

Abhi Venigalla

@ml_hardware

7 months

What is going on with arc-challenge evals? Lots of great new models report scores in the high 80s-90s in their blogs. But then OSS eval frameworks like @AiEleuther harness and @MosaicML gauntlet seem to report lower scores... Clearest example is Mixtral-8x7B: * blog post: 0.858

1

6

51

Abhi Venigalla

@ml_hardware

1 year

Overall, I'm super optimistic about the future for AI hardware. More options means more compute supply, more market pressure on prices, and lower costs for users. If your hardware supports PyTorch 2.0 too ( @HabanaLabs ???) reach out to us and we would love to showcase it!

1

2

48

Abhi Venigalla

@ml_hardware

1 year

For more projections on (MI250, MI300x) vs. (A100, H100) check out @dylan522p 's companion blog here:

AMD AI Software Solved – MI300X Pricing, Performance, PyTorch 2.0, Flash Attention, OpenAI Triton

Matching Nvidia Performance With 0 Code Changes With MosaicML

www.semianalysis.com

2

5

48

Abhi Venigalla

@ml_hardware

1 year

@AudioBooksRU At 50% MFU, this is about 545k A100-days So 10k A100s running for ~2mo but ofc this is Google so probably done on TPU pods

1

46

Abhi Venigalla

@ml_hardware

11 months

One ongoing story I'm really excited about is the Triton compiler, which AMD has been investing a lot into. The end result: you can write 1 Triton kernel, and run it at high perf on NVIDIA or AMD GPUs! Here's the current (fwd) perf of a Triton FA-2 kernel on A100 vs. MI250:

2

7

47

Abhi Venigalla

@ml_hardware

1 year

no taxation without representation 🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸🇺🇸 no computation without competition

3

4

45

Abhi Venigalla

@ml_hardware

1 year

@sharifshameem In the olden days they trained 125M models for 300B tokens (2400x) This was probably too much haha... Chinchilla recommendation is ~20x LLaMa2-7B is 7B params for 2T tokens (286x), and doesn't look saturated! But somewhere between 286x and 2400x it's probably enough :P

1

43

Abhi Venigalla

@ml_hardware

2 years

Super excited to share more LLM work from @MosaicML ! This time, @lindensli and I reveal exactly how much it costs to train LLMs (up to GPT-3 or Chinchilla quality!!) on MosaicML Cloud:

Mosaic LLMs: GPT-3 quality for

Training large language models (LLMs) costs less than you think. Using the MosaicML platform, we show how fast, cheap, and easy it is to train these models at scale (1B -> 70B parameters). With new...

www.databricks.com

3

9

42

Abhi Venigalla

@ml_hardware

10 months

Honestly insane that you can get this close <3

2

0

43

Abhi Venigalla

@ml_hardware

2 years

Our LLM stack got a big upgrade this week thanks to @DohmannJeremy 🎉 In-Context-Learning eval is now super fast and scalable. My favorite part -- it's so fast, we can measure metrics like LAMBADA/PIQA *live* while training, no need to checkpoint/use a separate eval harness.

1

38

Abhi Venigalla

@ml_hardware

10 months

iykyk

5

0

38

Abhi Venigalla

@ml_hardware

1 year

@andrew_n_carr See below for details, basically it is GPT3-175B trained for ** 1.2B ** tokens, not the full 300B tokens. Would be ~2 days for the full run, still super impressive!

Abhi Venigalla

@ml_hardware

1 year

@mustafasuleymn For those curious about the benchmark: It's not quite the full GPT-3, the MLPerf LLM benchmark gives you a checkpoint to start from and then you have to hit a target loss. In practice, it looks like the 11min run was over ** 1.2B ** tokens, not the full 300B tokens:

0

14

93

1

2

36

Abhi Venigalla

@ml_hardware

11 months

@ylecun @xlr8harder @geoffreyhinton 100x is roughly 1e27 FLOPs H100 can do 500e12 FLOP/s (at conservative 0.25 MFU) 1e27/500e12/3600/24 = 23M H100-days Aka 200k H100s for ~ 4 months This will be done within 18mo

4

2

37

Abhi Venigalla

@ml_hardware

2 years

@robertnishihara @Meta Sorry but I think the throughput comparison is incorrect. Meta is reporting 147 **Model** TFLOPs/GPU and you are reporting 179 **Hardware** TFLOPs/GPU. In your blog, Ray+JAX+Alpa only achieves 102.4 **Model** TFLOPs/GPU. Which is ~30% slower than Meta + PyTorch + FSDP.

4

1

36

Abhi Venigalla

@ml_hardware

11 months

Thanks to the software+hardware stack that AMD has been buildling, we didn't have to make any code changes -- we just ran LLM Foundry with the ROCm 5.7 + PyTorch docker image, and everything just works.

1

2

36

Abhi Venigalla

@ml_hardware

4 months

@cHHillee @finbarrtimbers +++ 1) is the biggest issue Training transformers = big dense matmuls And GPUs are already close to the limit for big dense matmuls. There are no 10x's on the table. Not even 2x's imo. If you want a training 10x, you need to *change the workload* away from dense matmuls

3

1

35

Abhi Venigalla

@ml_hardware

1 year

@finbarrtimbers Better to learn Triton :)

GitHub - triton-lang/triton: Development repository for the Triton language and compiler

Development repository for the Triton language and compiler - triton-lang/triton

github.com

3

1

34

Abhi Venigalla

@ml_hardware

1 year

Alright folks you know what time it is 🎉 ML FASHION WEEK 🎉 Day 1: Hopper Cowboys, vintage 2023. Like New. Attainable only through blood sacrifice to H100s

1

3

35

Abhi Venigalla

@ml_hardware

6 months

@abacaj OTOH, if your desire is to use totally-local ~10B models, then yes the quality is looking pretty saturated to me. I've said it before but when open GPT-4 class models come out later this year, I expect them to be in the 300B-500B param range. At least. Nobody's ready for it...

2

3

33

Abhi Venigalla

@ml_hardware

1 year

Will be in ICML next week Mon - Sun! ✈️ DM me if you want to catch up and talk about LLMs, AI hardware, or efficient training/inference!

0

6

34

Abhi Venigalla

@ml_hardware

1 year

🎉 ML FASHION WEEK🎉 Day 2: Scaling Laws, circa 2023. You can buy this one online! Kudos @antimatter15 for the design, and @GoogleDeepMind for the math

Abhi Venigalla

@ml_hardware

1 year

Alright folks you know what time it is 🎉 ML FASHION WEEK 🎉 Day 1: Hopper Cowboys, vintage 2023. Like New. Attainable only through blood sacrifice to H100s

1

3

35

1

34

Abhi Venigalla

@ml_hardware

2 years

I’m in New Orleans for #NeurIPS this week! Hmu if you want to eat oysters and talk about LLMs 😊

3

32

Abhi Venigalla

@ml_hardware

3 years

Excited to share what we do @MosaicML ! We build and compose methods to train better ML models faster. We are very focused on efficiency as a developer sees it: [Quality] / [$, Time] (also: 10mo ago I read a tweet, DM'd a stranger, joined a startup, and I couldn't be happier :)

Naveen Rao

@NaveenGRao

3 years

10 months ago I tweeted that we were getting a new project off the ground…today I’m proud to announce with @hanlintang @jefrankle and @mcarbin that MosaicML comes out of stealth! We are focused on the compute efficiency of neural network training using algorithmic methods.👇

26

19

160

1

4

31

Abhi Venigalla

@ml_hardware

1 year

@jefrankle @togethercompute Made possible thanks to our lightning fast eval framework :)

2

30

Abhi Venigalla

@ml_hardware

1 year

"LLMs can't do long context length" 👀👀👀

Naveen Rao

@NaveenGRao

1 year

🤯🤯 LLM trained with 64K+ context length! What could you do with that? Prompted our model with the ENTIRE contents of "The Great Gatsby" and asked it to write the epilogue. Snippet 👇 Model dropping soon to an open-source repo near you. Epilogue: It seemed to me that Gatsby

41

89

675

1

5

30

Abhi Venigalla

@ml_hardware

1 year

@ryan_landay @realGeorgeHotz I think George has been working on consumer AMD GPUs, this effort was totally separate. Basically we found that PyTorch worked out of the box on MI250. Took <1 hr from getting the hardware to training MPT.

1

0

30

Abhi Venigalla

@ml_hardware

1 year

@NCSilberman @soumithchintala @MosaicML @PyTorch We plan to do an MI300x vs. H100 comparison when the time is right. We wanted to compare "same-generation" for now so MI250 vs. A100. We also separately did some H100 profiling in April:

Mosaic Research | Databricks Blog

Latest blogs from the team at Mosaic Research

www.databricks.com

0

3

29

Abhi Venigalla

@ml_hardware

1 year

@realGeorgeHotz any chance we could try this out on a tinybox? If PyTorch 2.0 + ROCm 5.4 installs then I think there's a good chance!

3

2

26

Abhi Venigalla

@ml_hardware

1 year

@bernhardsson Yes but the 2nd order effect is what @cHHillee wrote excellently here: TL;DR * yes you could fit large models on one node * but training would take centuries * so you get a cluster of GPUs to compress time * but you cant scale BS forever without hurting

Horace He

@cHHillee

1 year

@AravSrinivas @_mohansolo @bernhardsson Technically, the dependency actually goes the other way. The more tokens you have, the worse interconnect bandwidth you can get away with. In practice, this statement is true: "The larger/better the model you want to train is, the better your interconnect needs to be." But the

2

31

1

0

28

Abhi Venigalla

@ml_hardware

1 year

@bernhardsson I dont think its much related to model size, as others have commented. Its just that as you train bigger models, you start scaling past single node, so you *notice* the BW reqs To train a small GPT-125M efficiently multinode data parallel you want a device BS ~ 8k tok/A100 and

2

1

27

Abhi Venigalla

@ml_hardware

11 months

AMD MI300X GPUs are coming to Azure!

Azure announces new AI optimized VM series featuring AMD’s flagship MI300X GPU

In our relentless pursuit of pushing the boundaries of artificial intelligence, we understand that cutting-edge infrastructure and expertise is needed to..

techcommunity.microsoft.com

0

6

28

Abhi Venigalla

@ml_hardware

2 years

@karpathy @MosaicML We had one of these moments a few days ago re. microbatch size… most trainers require MBS to be a divisor of global batch size, so if 8 OOMs you have to go down to 4… …but Composer doesn’t have any limitation. So we tried MBS=7 and it works fine and gets you +2% MFU 🤣

1

0

27

Abhi Venigalla

@ml_hardware

11 months

Overall, I could not be more excited for @AMD 's future in AI hardware. Looking ahead to what we know publicly about MI300X... there might be a new king in town soon! 👑 Follow us @MosaicML and @databricks for more news, and ask your CSPs about MI300X!

1

4

27

Abhi Venigalla

@ml_hardware

1 year

Holy moly, 9 supercomputers with 64xCS-2 each!!!

Cerebras

@CerebrasSystems

1 year

📣 Today we are announcing Condor Galaxy-1: a 4 exaflop AI supercomputer built in partnership with @G42ai . Powered by 64 Cerebras CS-2 systems, 54M cores, and 82TB of memory – it's the largest AI supercomputer we've ever built. But that's not all: CG-1 is just the start..

18

85

366

1

0

27

Abhi Venigalla

@ml_hardware

1 year

@HamelHusain Just want to caveat -- device_map puts different layers on different GPUs to spread the memory, but * it does not split the work * aka the model forward is not sped up at all. The right solution is tensor parallelism but more manual to implement. Torch2 will help solve this

1

0

26

Abhi Venigalla

@ml_hardware

6 months

@abacaj imo it's simultaneously: * building great models takes lots of data and compute and tuning and crazy infra. It IS getting harder * BUT rather than keep all this infra to ourselves, we are making it a *product*. It's the foundry, not the model So anyone can come to Databricks

1

5

26

Abhi Venigalla

@ml_hardware

1 year

@LigengZhu @NaveenGRao We'll share more details later this week, but basically, with the right framework and arch it just kinda works. FSDP, activation checkpointing and microbatch_size=1. We were profiling toy models like this one back in October. Surprised no one else trained a full one yet :)

3

2

25

Abhi Venigalla

@ml_hardware

11 months

Our new AMD Part 2 blog dives into software improvments, AMD + Triton, comparisons with NVIDIA A100/H100, and more: Highlights in thread below!

Training LLMs at Scale with AMD MI250 GPUs | Databricks Blog

We benchmarked LLM training on a multi-node AMD MI250 cluster and found near-linear scaling on up to 128 GPUs, demonstrating a compelling option for multi-node LLM training.

www.databricks.com

1

26

Abhi Venigalla

@ml_hardware

3 years

Lots of useful tips in here for efficient CNN training! * mixed precision * NHWC * stepwise learning rates * balancing CPU/GPU data aug * Pillow-SIMD >> Pillow

Databricks Mosaic Research

@DbrxMosaicAI

3 years

New blog post! Take a look at some best practices for efficient CNN training, and find out how you can apply them easily with our Composer library: #EfficientML

1

18

80

0

7

24

Abhi Venigalla

@ml_hardware

1 year

@suchenzang 1)+2), a lot has improved in the past few months. Many providers ( @MosaicML included) will rent out 512+ A100s for a few weeks-months. We call it a "hero run" TL;DR renting 1024xA100 with 800+ Gbps for 70 days (LLaMa2-70B) is def possible without 1yr commit. Pricing TBD ofc

3

0

25

Abhi Venigalla

@ml_hardware

6 months

@agihippo reka weights when?

1

23

Abhi Venigalla

@ml_hardware

6 months

@aaron_defazio This is dark magic... how is it so close to the pareto frontier throughout? Does this mean you can stop at any timestep and continue training without rewarming/doing something weird to the LR?

3

1

24

Abhi Venigalla

@ml_hardware

6 months

@simonw @zpete It has to do with KV cache allocation -- you have a finite amount of GPU memory, and when a request comes in you dont know when the generation will end, so you have to preallocate up to max_output_tokens. The bigger that value, the fewer concurrent users you can handle.

4

0

23

Abhi Venigalla

@ml_hardware

2 years

@NamanGoyal21 On top of this, choosing better architectures + recipes can get you to the same quality much faster. Think Chinchilla + UL2 + MoE + Instruction-finetuning + ... By end of 2023, the cost to train a GPT-3 quality model will be less than $100k. Maybe a lot less.

2

3

24

Abhi Venigalla

@ml_hardware

2 years

@EMostaque Strongly agree. By 2033, it will be closer to 1,000,000x improvement. Even Jensen talks about 1,000,000x in the next decade and that’s only NVIDIA’s contribution! Better algorithms will push us much farther…

0

1

24

Abhi Venigalla

@ml_hardware

1 year

@Dantali84254624 got even closer on the param estimate!

Dantalion

@Dantali84254624

1 year

@abhi_venigalla I'd go a bit bigger, 1e25 flops 331B paramters $50m

1

0

11

0

23

Abhi Venigalla

@ml_hardware

11 months

First of all, what performance are we seeing today? On single-node (4xMI250), we see up to 165 TFLOP/s/GPU, up from 144 TFLOP/s/GPU in June. This is all thanks to software improvements in ROCm and a new FlashAttention-2 kernel from AMD.

1

2

22

Abhi Venigalla

@ml_hardware

1 year

So many to credit.. if you don't follow these folks yet, you should :) @jefrankle @leavittron @sam_havens @vitaliychiley @danielking36 @DohmannJeremy @JacobianNeuro @davisblalock @alexrtrott @code_star @mvpatel2000 @dskhudia @evanracah @iamknighton @karankjariwala

0

1

21

Abhi Venigalla

@ml_hardware

5 months

@AravSrinivas One last idea -- there are two very distinct types of oss users 1) ppl at home trying to run a model for themselves, on like a consumer gpu rig or laptop 2) model hosting providers, who run on cloud gpus and offer apis to their customers My read is type 1) prefers dense but

0

21

Abhi Venigalla

@ml_hardware

9 months

Now the story gets even better for LLM inference... First, we find that on LLaMa2-70B with 8-way TP and BF16, the 8xGaudi2 system ~matches the 8xH100 in decoding pace! The Gaudi2 sofware+hardware stack is able to acheive higher utilization of its memory bandwidth.

1

5

21

Abhi Venigalla

@ml_hardware

2 years

@sytelus Honestly? It's because they aren't yet available to rent in a major cloud. If that was true, you would hear many more folks trying them out and doing heads-up with A100s. The native PyTorch support is sweet!

2

0

21

Abhi Venigalla

@ml_hardware

10 months

So who else is in South Padre tonight :D

2

0

21

Abhi Venigalla

@ml_hardware

1 year

@giffmana If you use our streaming library you’ll get deterministic resumption and won’t have this problem :) even if changing world sizes

GitHub - mosaicml/streaming: A Data Streaming Library for Efficient Neural Network Training

A Data Streaming Library for Efficient Neural Network Training - mosaicml/streaming

github.com

1

0

20

Abhi Venigalla

@ml_hardware

9 months

At Databricks GenAI we help customers build+deploy AI, on their own data, as efficiently as possible. Using the right HW for each workload is a big piece of that, so we'll continue to test, validate, and integrate new ML HW. If working on this excites you, DM me or @dskhudia !🙌

2

3

20

Abhi Venigalla

@ml_hardware

11 months

Last thing: a fun throwback if you haven't seen it before :D

Abhi Venigalla

@ml_hardware

1 year

And yes, you can switch back and forth between NVIDIA and AMD, even within a single training run. It's Christmas in July!🎄

9

46

421

2

0

20

Abhi Venigalla

@ml_hardware

1 year

@realGeorgeHotz Hey @realGeorgeHotz , if you want a baseline for LLM training, i'm like 80% sure our repo will work out of the box on AMD GPUs. We just use PyTorch and FSDP. I don't have any AMD GPUs myself rn but we've had community members report that it works. Re.

GitHub - mosaicml/llm-foundry: LLM training code for Databricks foundation models

LLM training code for Databricks foundation models - mosaicml/llm-foundry

github.com

2

1

20

Abhi Venigalla

@ml_hardware

11 months

You can find more info on AMD+Triton in their slides from this year's inaugural Triton Dev Conference:

1

20

Abhi Venigalla

@ml_hardware

1 year

Want to train even faster + cheaper? Check out our profiling results for LLMs on @nvidia H100! Already 3x faster with our current stack training in FP8

Databricks Mosaic Research

@DbrxMosaicAI

1 year

How good are @nvidia H100s actually? In collaboration with @CoreWeave , we benchmarked A100 vs H100 performance for large language model training. Here's what we found: [1/6]

2

51

226

1

2

19

Abhi Venigalla

@ml_hardware

3 years

@rabois Is that why I saw you @ blue bottle in fidi today 👀

1

2

19

Abhi Venigalla

@ml_hardware

1 year

@cis_female Try increasing layer widths and I bet you itll hit over 50%!

2

0

17

Abhi Venigalla

@ml_hardware

11 months

And quick shoutout to @cpuhrsch and @cHHillee , who showed just 2 weeks ago @PyTorch Conference how to use `torch.compile()`, (which uses Triton) to run LLM inference on NVIDIA or AMD using the same PyTorch code! (left=1xA100, right=half of 1xMI250)

1

3

17

Abhi Venigalla

@ml_hardware

1 year

You can even switch compute providers mid-run seamlessly! Deterministic, elastic, fast and zero switching cost.

1

2

16

Abhi Venigalla

@ml_hardware

1 year

We've been working around the clock to build both the models ** and the infrastructure ** to make LLM training easy, efficient, and stress-free. Here's the MPT-7B training logbook. It's empty! Over 9.5 days on 440xA100s, there was zero human intervention. No more babysitting!

2

0

16

Abhi Venigalla

@ml_hardware

3 years

Awesome CV data-loading speedups! Seems very useful for keeping up with ever-faster GPUs like the A100. Small, fast models like ResNet18 also finally get unblocked. Also pumped to see @MosaicML numbers in the plots!

Aleksander Madry

@aleks_madry

3 years

ImageNet is the new CIFAR! My students made FFCV (), a drop-in data loading library for training models *fast* (e.g., ImageNet in half an hour on 1 GPU, CIFAR in half a minute). FFCV speeds up ~any existing training code (no training tricks needed) (1/3)

29

390

2K

0

1

17

Abhi Venigalla

@ml_hardware

2 years

Thanks @karlfreund for the shoutout! We’re really excited about our BERT pre-training speedups. We can’t wait to try them out on H100s and break even more records :)

2

17

Abhi Venigalla

@ml_hardware

1 year

🎉 ML FASHION WEEK 🎉 Day 3: Today’s all about ML hardware! My personal favorites are the Hot Chips socks (2020) and Dojo tshirt (Tesla AI Day 2022). Crop top is optional

Abhi Venigalla

@ml_hardware

1 year

🎉 ML FASHION WEEK🎉 Day 2: Scaling Laws, circa 2023. You can buy this one online! Kudos @antimatter15 for the design, and @GoogleDeepMind for the math

1

34

2

0

17

Abhi Venigalla

@ml_hardware

1 year

Our *open-source* LLM training stack keeps getting faster! And this is all done with pure FSDP and a custom HF model. No fancy parallelization necessary.

Vitaliy Chiley

@vitaliychiley

1 year

Updated @MosaicML LLM training throughput tables. Here are some highlights: - Best HFU: 73.63%!!! 🚀 13B w/ act ckpt - Best MFU: 62.09%!!! 🔥 3B w/out act ckpt - Train with SeqLen 65k 🤯 Details here: [1/5]

5

21

146

0

1

17

Abhi Venigalla

@ml_hardware

6 months

Less than a year ago we were still training our first MPT-7B... Now we're building and serving MoEs at scale and just getting started :) DBRX is a demonstration of what our AI foundry can do. If your organization wants to build a custom AI model, come to Databricks and use all

1

0

17