Tim Dettmers @Tim_Dettmers Twitter profile | Pikagi

Pikagi

Tim Dettmers

@Tim_Dettmers

31,803

Followers

902

Following

126

Media

3,213

Statuses

Research Scientist @allen_ai and incoming professor @CarnegieMellon . I blog about deep learning and PhD life at .

Seattle, WA

https://t.co/H5HgSvIuNm

Joined October 2012

Don't wanna be here? Send us removal request.

Pinned Tweet

@Tim_Dettmers

Tim Dettmers

2 months

After 7 months on the job market, I am happy to announce: - I joined @allen_ai - Professor at @CarnegieMellon from Fall 2025 - New bitsandbytes maintainer @Titus_vK My main focus will be to strengthen open-source for real-world problems and bring the best AI to laptops 🧵

152

85

2K

Last Seen Profiles

@Silva070820Mj

@bokeplokalmalam

@TheBestCan1989

@BinorRaja

@MPSHaringey

@FurrRosie63303

@gwonhyeog179460

@Jkn2Pjq

@iamliam737

@kotobukiblog

@cheddargran

@reliablerani

@suzurijp

@xnx_1100

@BinorRaja

@York_tokyo

@YY3470040440184

@Youthful52299

@BinorRaja

@AsSa66986762

@mymunge

@sharoushi_juken

@cheddargran

@bokeplokalmalam

@sZM1EISrzutlMm

@SharonDoduaOtoo

@inrevenuecap

@Sex_arabii

@kajuhann

@GURBETTEHATAYLI

@Riteishd

@h98569856

@BinorRaja

@Juliet_ui

@issimplyamazing

@emreyunus1616

@Tim_Dettmers

Tim Dettmers

1 year

QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark: Paper: Code+Demo: Samples: Colab:

Tweet media one

90

948

4K

@Tim_Dettmers

Tim Dettmers

3 years

I am excited to share my latest work: 8-bit optimizers – a replacement for regular optimizers. Faster 🚀, 75% less memory 🪶, same performance📈, no hyperparam tuning needed 🔢. 🧵/n Paper: Library: Video:

Tweet media one

18

283

1K

@Tim_Dettmers

Tim Dettmers

1 year

@karpathy Super excited to push this even further: - Next week: bitsandbytes 4-bit closed beta that allows you to finetune 30B/65B LLaMA models on a single 24/48 GB GPU (no degradation vs full fine-tuning in 16-bit) - Two weeks: Full release of code, paper, and a collection of 65B models

39

193

1K

@Tim_Dettmers

Tim Dettmers

1 year

We present SpQR, which allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ: 🧵

Tweet media one

36

302

1K

@Tim_Dettmers

Tim Dettmers

2 years

We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More: Paper: Software: Emergence:

Tweet media one

17

250

1K

@Tim_Dettmers

Tim Dettmers

5 years

How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n

16

311

1K

@Tim_Dettmers

Tim Dettmers

2 years

We release the public beta for bnb-int8🟪 for all @huggingface 🤗models, which allows for Int8 inference without performance degradation up to scales of 176B params 📈. You can run OPT-175B/BLOOM-176B easily on a single machine 🖥️. You can try it here: 1/n

Tweet media one

27

228

923

@Tim_Dettmers

Tim Dettmers

4 years

Updated GPU recommendations for the new Ampere RTX 30 series are live! Performance benchmarks, architecture details, Q&A of frequently asked questions, and detailed explanations of how GPUs and Tensor Cores work for those that want to learn more:

Tweet media one

31

252

890

@Tim_Dettmers

Tim Dettmers

2 years

In the RTX 40 post, I introduce a GPU recommendation chart and discuss the new Tensor Memory Accelerator (TMA) and FP8 computation. Overall, RTX 40s are faster for inference and shine through their FP8 performance but are inefficient for 16-bit training.

Tweet media one

38

171

886

@Tim_Dettmers

Tim Dettmers

1 year

The 4-bit bitsandbytes private beta is here! Our method, QLoRA, is integrated with the HF stack and supports all models. You can finetune a 65B model on a single 48 GB GPU. This beta will help us catch bugs and issues before our full release. Sign up:

Tweet card media

Private 4-bit bnb Beta: 65B LLMs on a Single GPU

Thank you for your interest in our private beta! As a beta participant, you will have early access to 4-bit quantization and paged optimizers code for efficient LLM finetuning. This code allows...

docs.google.com

25

158

868

@Tim_Dettmers

Tim Dettmers

1 year

The result of long days of CUDA optimizations: the new bitsandbytes release includes 4-bit inference, which is up to 4.2x faster than 16-bit inference (bsz=1). Full HF integration for all models. No code change needed. Bnb is growing rapidly, just shy of 1M installs/month🧵

Tweet media one

24

144

865

@Tim_Dettmers

Tim Dettmers

2 years

Finished RTX 4090 modeling ...not good 😐. If you have an RTX 3090, probably best to wait 4 years for chiplets and consumer HBM. This is what dead Moore's law looks like. You can only scale cost/perf with features, but you can only add Tensor Cores once. We are stuck. More soon!

26

56

709

@Tim_Dettmers

Tim Dettmers

1 year

Just a reminder that the default hyperparameters of LoRA are performing poorly. You need to attach LoRA modules to all layers for it to perform as well as full fine-tuning. Once you do that, we find there is no difference between LoRA and fine-tuning.

Tweet card media

QLoRA: Efficient Finetuning of Quantized LLMs

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance....

@Shahules786

Shahul Es

1 year

LoRA is not a drop-in replacement for Full Finetuning. Even though it reduces the compute requirements by 3x it comes with certain limitations. The data preparation needed for both is also different. 🔑 - LoRA requires much more data to converge compared to full FT. This can be

Tweet media one

21

61

419

17

113

673

@Tim_Dettmers

Tim Dettmers

2 years

I got excited about a paper, implement stuff and then see they cheated: (1) Copy baseline results from other paper, (2) do much more hyperparam tuning on their own method, (3) accepted to EMNLP. Results look good, but their method is crap! Why waste people's time like this?

39

38

629

@Tim_Dettmers

Tim Dettmers

6 years

I just updated my full deep learning hardware for the latest recommendations and advice. I reframed the blog post to help you avoid the most costly mistakes when you are building a deep learning machine.

12

158

515

@Tim_Dettmers

Tim Dettmers

12 days

Open-source models beating closed models will become more and more common. Scaling has diminishing returns. The best solution will not have the largest scale but best approach or data. Especially with test-time compute, you do not need the best model to have the best solution.

@allen_ai

Ai2

12 days

Meet Molmo: a family of open, state-of-the-art multimodal AI models. Our best model outperforms proprietary systems, using 1000x less data. Molmo doesn't just understand multimodal data—it acts on it, enabling rich interactions in both the physical and virtual worlds. Try it

50

272

1K

12

81

514

@Tim_Dettmers

Tim Dettmers

4 years

I am curious why people are not talking more about the OpenAI scaling law papers. For me, they seem very significant. What I heard so far: "Too complicated. I don't understand and I don't care", "NLP is not physics". Other criticism? Any insights why people ignore it?

23

68

454

@Tim_Dettmers

Tim Dettmers

4 years

New GPUs have arrived, and they come with GDDR6X! You can expect a ~45% speed increase with the RTX 3090 vs RTX 2080 Ti. 3-slot-width is a problem though as is the fan-design. 4x RTX 2080 Ti >> 2x RTX 3090. 24GB mem is great, but RTX 3080 with 10GB is not very useful.

20

40

428

@Tim_Dettmers

Tim Dettmers

1 year

Our work on loss spikes and stable 8-bit CLIP training is the largest Int8 training to date (1B). We introduce the SwitchBack layers and StableAdamW to ensure stability at these scales. Work with the awesome @Mitchnw Paper: Colab:

Tweet media one

4

99

432

@Tim_Dettmers

Tim Dettmers

6 years

I just updated my GPU recommendation blog post! I included the RTX Titan and GTX 1660 Ti in my analysis. The analysis now separates word RNNs from char RNNs/Transformers. I also recommend TPUs for larger transformers/CNNs. This and more in the update:

6

128

428

@Tim_Dettmers

Tim Dettmers

7 years

Out of nowhere: far better translator than Google. Begs the question: Can Google be overtaken in search too? #dlearn

Tweet media one

20

231

387

@Tim_Dettmers

Tim Dettmers

1 year

Looking at the comments, some people missed the Guanaco-33B demo because it was added later: Big thanks to @huggingface for sponsoring this demo! The second thing I noticed was that people were a bit lost on how to use the adapters. So here a tutorial🧵

Tweet card media

Guanaco Playground Tgi - a Hugging Face Space by uwnlp

@Tim_Dettmers

Tim Dettmers

1 year

QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark: Paper: Code+Demo: Samples: Colab:

Tweet media one

90

948

4K

11

73

392

@Tim_Dettmers

Tim Dettmers

25 days

Surprisingly many details here (for OpenAi-level secrecy) of how they build the model.

@OpenAI

OpenAI

25 days

Some of our researchers behind OpenAI o1 🍓

232

841

7K

12

27

397

@Tim_Dettmers

Tim Dettmers

2 years

This is the main driving assumption of my research and it is still holding up after 10 years: Humans are not special, scale is. The other main fact (sparsity): Humans are not special, but primates are. Only primates and birds have neurons not proportional to their body size

@SilverVVulpes

Siberian fox🔸

2 years

People claimed the human brain was special relative to other primates in the size of the temporal lobes, involved in functions such as language. Newer data once again shows that no, the human brain is just a scaled up primate brain

Tweet media one

Tweet media two

64

211

1K

15

43

384

@Tim_Dettmers

Tim Dettmers

4 years

Turns out a lot of open-domain QA datasets have test set leakage. If you control for it, model performance drops by a mean absolute of 63%. Yikes! If we missed this for such a long time, I wonder if there are problems with other NLP datasets too.

Tweet card media

Question and Answer Test-Train Overlap in Open-Domain Question...

Ideally Open-Domain Question Answering models should exhibit a number of competencies, ranging from simply memorizing questions seen at training time, to answering novel question formulations with...

4

97

374

@Tim_Dettmers

Tim Dettmers

8 months

@karpathy I have also seen this before. I think it's the psychology of material coming all at once that can be overwhelming for newcomers. If one builds up things bit-by-bit there not this overwhelming feeling of "this is too much; I am not good enough to learn this".

6

6

361

@Tim_Dettmers

Tim Dettmers

2 years

We ran +35,000 zero-shot experiments for our work on k-bit Inference Scaling Laws📈. A 30B 8-bit and 60B 4-bit LLM have the same model bits/inference latency, but different zero-shot accuracy. What is the best trade-off? The answer is clear: 4-bit is best

Tweet media one

6

66

348

@Tim_Dettmers

Tim Dettmers

4 years

A friend asked me for a reference for why we did not increase the frequency further in CPUs and why parallelism was necessary to increase performance. This puts it quite bluntly (from ).

Tweet media one

4

65

343

@Tim_Dettmers

Tim Dettmers

2 years

Just finished the final update for the RTX 40 GPU blog post: - Performance/$ now includes Total Cost of Ownership in cost estimate (computer + 5y electricity) - Discussion Async copy vs. TMA - Small update on FP8 training - Font and figure improvements

Tweet media one

12

58

342

@Tim_Dettmers

Tim Dettmers

6 months

This is excellent work — a big step forward in quantization! It enables full 4-bit matmuls, which can speed up large batch inference by a lot. Anyone deploying LLMs at scale will soon use this or similar techniques.

@AshkboosSaleh

Saleh Ashkboos

6 months

[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. With @akmohtashami_a @max_croci @DAlistarh @thoefler @jameshensman and others Paper: Code:

Tweet media one

7

67

313

5

58

342

@Tim_Dettmers

Tim Dettmers

7 months

This model can automatically debug CUDA version errors. AGI achieved ✅😂

@cognition_labs

Cognition

@cognition_labs

7 months

3/4 Devin can train and fine tune its own AI models.

19

142

2K

5

27

336

@Tim_Dettmers

Tim Dettmers

6 years

This work presents strong and rigorous evidence that we should abandon RNNs and move on to using convolutions for sequence modeling. I also made similar experiences in other domains such as graph embeddings and knowledge compression. Definitely an important read!

7

77

328

@Tim_Dettmers

Tim Dettmers

1 year

The latest release of bitsandbytes has an improved CUDA setup and A100 4-bit inference. I thought that A40 and A100 GPUs were close enough, and optimized for A40s, but they are very different. A100 performance is now 40% faster with a small hit for other GPUs.

Tweet media one

9

54

322

@Tim_Dettmers

Tim Dettmers

6 years

I updated my guide with new GPU recommendations: RTX 2080 most cost-efficient choice. GTX 1080/1070 (+Ti) cards remain very good choices, especially as prices drop. Some discussion on TPUs/AWS — can be good in some cases.

Tweet media one

9

125

324

@Tim_Dettmers

Tim Dettmers

1 year

Did some optimizations for Ada/Ampere/Turing for 4-bit inference (bsz=1, arbitrary datatype e.g. NF4). It is now 3.71x, 3.13x, and 1.72x speedup vs 16-bit. The expected max would be 3.55x if NVIDIA kernels were 100% efficient. Will be released on Monday (no code change needed).

9

34

313

@Tim_Dettmers

Tim Dettmers

1 year

Continued pretraining with QLoRA is just around the corner! A second pretraining of models like Falcon-40B in 4-bit would be super-efficient.

@guitaricet

Vlad Lialin

1 year

Parameter-efficient fine-tuning revolutionized the accessibility of LLM fine-tuning, but can they also revolutionize pre-training? We present ReLoRA — the first PEFT method that can be used for training from scratch! 🔥🔥

Tweet media one

14

217

884

9

42

305

@Tim_Dettmers

Tim Dettmers

1 year

I never had time to do the proper bitsandbytes 4-bit release. The 0.39 release includes the 4-bit quantization variants and CUDA kernels, paged optimizers, Lion, as well as an important bugfix for a memory leak in 8-bit training/inference .

Tweet media one

7

34

298

@Tim_Dettmers

Tim Dettmers

9 months

@typedfemale Yes, it is a big problem. I really want to create a class for machine learning systems that also has an emphasis on CUDA programming for deep learning. So many people were interested in this. I will probably get on this once I finish the faculty application process.

7

7

303

@Tim_Dettmers

Tim Dettmers

10 months

Today, I will give a talk about "The making of QLoRA" at the LLM Efficiency Challenge at 2:30pm, Room 356. I will also talk a bit about how I go about doing research, running experiments and figuring out "what works".

13

26

302

@Tim_Dettmers

Tim Dettmers

1 year

A major bug in 8-bit optimizers that could cause some instabilities later in training has been fixed. Please update bitsandbytes to 0.41.1 via `pip install -U bitsandbytes`. Now 8-bit optimizer should again reproduce 32-bit optimizer performance.

12

46

291

@Tim_Dettmers

Tim Dettmers

3 years

I have been working on 8-bit optimizers, and I am looking for testers for the initial release to test installation and ease of use. Uses up to 63% less GPU memory, faster/stabler training while maintaining performance. Currently, 8-bit Adam and 8-bit Momentum are supported. 1/5

Tweet media one

4

52

282

@Tim_Dettmers

Tim Dettmers

5 years

My new work with @LukeZettlemoyer on accelerated training of sparse networks from random weights to dense performance levels — no retraining required! Paper: Blog post: Code:

Tweet media one

2

81

277

@Tim_Dettmers

Tim Dettmers

11 months

Bitsandbytes now supports 4-bit store/load of any model. Load in 4-bit via: from_pretrained(name, ..., load_in_4bit=True, device_map='auto') Then save/push the model to the hub. Get the newest bnb: pip install -U bitsandbytes Implemented by Ruslan Svirschevski (gh: poedator).

4

38

274

@Tim_Dettmers

Tim Dettmers

2 years

We now have Int8 backprop support for all GPUs for bitsandbytes! Now available via 'pip install bitsandbytes'. This was a contribution from @sasha_borzunov . We will release Int8 fine-tuning for all @huggingface models soon — stay tuned!

Tweet card media

Release Int8 Matmul backward for all GPUs · bitsandbytes-foundation/bitsandbytes

This release changed the default bitsandbytets matrix multiplication (bnb.matmul) to now support memory efficient backward by default. Additionally, matrix multiplication with 8-bit weights is supp...

6

40

270

@Tim_Dettmers

Tim Dettmers

2 years

The GPU blog post update is 90% done now. I think tomorrow morning, we will have an update! 🚀

7

5

260

@Tim_Dettmers

Tim Dettmers

9 months

We made a QLoRA promo video for @UWITNews . It is a very nice summary of the motivation behind QLoRA and what the environment was like to develop this research. @uwcse is a perfect place for doing such research! Article: Youtube:

Tweet card media

QLoRA: Efficient Finetuning of Quantized LLMs

Learn how Research Computing supports Tim Dettmers, a Ph.D. student, create ground breaking advancements in the world of quantization, fine-tuning, and large...

www.youtube.com

3

44

256

@Tim_Dettmers

Tim Dettmers

1 year

@HamelHusain If you wait for another two weeks, we have something nice for you ;) With the right methods you can fine-tune a 30B model on that GPU. A 30B policy with 30B value function also works for RLHF.

16

15

251

@Tim_Dettmers

Tim Dettmers

9 months

The 0.42.0 bitsandbytes release adds 4-bit serialization, so you can save/load 4-bit weights directly. Otherwise, there are lots of bug fixes. Thank you, contributors! The next goal is Apple/AMD/Intel and Windows integration. We now have 1.5M installs per month.

Tweet media one

10

35

251

@Tim_Dettmers

Tim Dettmers

8 months

@willie_agnew Literally curing cancer. I talked to a biologist who used my methods in conjunction with open models to develop new methods for drug discovery. They developed drugs for previously incurable pediatric cancers. These are real wet lab in vitro results — it just works.

16

12

251

@Tim_Dettmers

Tim Dettmers

9 months

An excellent end-to-end guide for finetuning. It has all the details from data prep to deployment. If you want to finetune, this is a great resource to get started.

@_philschmid

Philipp Schmid

9 months

What's the best way to fine-tune open LLMs in 2024? Look no further! 👀 I am excited to share “How to Fine-Tune LLMs in 2024 with Hugging Face” using the latest research techniques, including Flash Attention, Q-LoRA, @OpenAI dataset formats (messages), ChatML, Packing, all built

20

171

852

3

45

243

@Tim_Dettmers

Tim Dettmers

5 years

This is really great work! For layer 5 pyramidal neurons: A dendritic branch = MLP with 1 layer, 4 units; the entire neuron = MLP with 7 layers, 128 units each. One bio neuron > most MNIST models. We have about 85bn neurons in total and >1tn dendrites — that is a lot of compute!

@DavidBeniaguev

David Beniaguev

@DavidBeniaguev

5 years

A story of a Cortical Neuron as a Deep Artificial Neural Net: 1) Neurons in the brain are bombarded with massive synaptic input distributed across a large tree like structure - its dendritic tree. During this bombardment, the tree goes wild preprint:

23

659

2K

7

66

241

@Tim_Dettmers

Tim Dettmers

8 years

I updated my GPU advice blog post with the GTX 1080 Ti; also cleaned it so it is easier to find relevant information

2

98

240

@Tim_Dettmers

Tim Dettmers

2 years

FP8 training works well and has large benefits. It has steep networking requirements to achieve good utilization but there are solutions to that too (we will release one in the next days). It's a big shift and everyone with RTX 40s / H100 GPUs should look into FP8 training.

@NamanGoyal21

Naman Goyal

2 years

Its crazy that, at 60% Model FLOPS (FP8) Utilization on H100, original GPT3 configuration can be trained in 3 days on 1024 H100s and PaLM on 12 days on 2048 H100s. That's roughly 50x lesser gpu hours for GPT3 paper 3 years back, and 9x lesser for palm released 9 months back

13

35

390

7

20

226

@Tim_Dettmers

Tim Dettmers

3 years

I am a huge fan of einsum notation. Here is a multi-layer transformer in a couple lines of code (without norms though). I think it's simple to read, but whenever I show this to somebody in excitement they do not like it. I am curious: How is that for you? Easy to read or not?

Tweet media one

30

22

236

@Tim_Dettmers

Tim Dettmers

1 year

I think it will take another day or two for the full integration, but the kernels (batch size 1) are ready.

Tweet media one

5

15

228

@Tim_Dettmers

Tim Dettmers

1 year

If you are merging adapters with QLoRA 4-bit weights, please use the gist below for merging. This will increase the performance of the QLoRA model. I think I have seen a PR on PEFT, so this will soon come to PEFT by default, but for now, its better to merge it in this way.

@chris_hayduk1

Chris Hayduk

1 year

Just put together a gist for merging QLoRA with the quantized model weights as mentioned by @Tim_Dettmers @Teknium1 @erhartford Since I know you guys were looking into it. Should be able to quantize the whole thing after this without issue

5

13

92

2

26

221

@Tim_Dettmers

Tim Dettmers

1 year

I forgot how much better Guanaco-65B is compared to 33B. You can try here via Petals (globally distributed inference): With Petals, you can also run a 65B model in a colab or locally on a small GPU at ~5 tokens/sec (see below).

@m_ryabinin

Max Ryabinin

1 year

@fernavid @Tim_Dettmers We've updated one just today: this notebook shows you: - how to run a 65B model from Colab, - how to plug in adapters between its layers, - and how to write custom generation methods (you can't do this with an API)

3

42

246

2

36

219

@Tim_Dettmers

Tim Dettmers

2 years

bitsandbytes is on track to surpass half a million pip installs this month! Upcoming features: - LLM.int8() support for all GPUs - Int8 backward for fine-tuning - Fast 4-bit float (FP4) kernels for inference Always looking for more people to get involved. There is lots to do!

6

15

213

@Tim_Dettmers

Tim Dettmers

1 year

The 0.38.0 release of bitsandbytes introduces: - 8-bit Lion which is 8x more memory efficient than standard Adam - Serialization of 8-bit layers now allows storing/loading 8-bit models to/from the HF Hub We are now at half a million installs per month!

Tweet media one

6

39

205

@Tim_Dettmers

Tim Dettmers

4 years

I have my first draft right now: 7,000 words. It will be quite a comprehensive post. If you have any more GPU-related questions for me, right now is the last chance for them to be added. I will freeze the draft tonight, rewrite tomorrow, and then publish it on Monday.

15

13

203

@Tim_Dettmers

Tim Dettmers

6 years

I am very excited and proud to announce that I will join UW as a PhD student this fall. I will work with Yejin Choi on common sense knowledge and reasoning. I believe that with common sense, intelligent machines will be able to benefit everyone equally.

13

3

203

@Tim_Dettmers

Tim Dettmers

4 years

Just working on an update to my GPU recommendation blog post. Will cover the new GPUs and focus on the cost-effectiveness of 2/4/8 GPU systems. Any other things that you would like to see discussed?

19

11

197

@Tim_Dettmers

Tim Dettmers

11 months

This model flew under the radar. It has the highest MMLU score of any open-source model. I have not tried it myself but I am curious how it compares to other models when evaluated across a broad range of tasks. Can somebody give it a try?

@01AI_Yi

Yi-01.AI

11 months

Our team at @01AI_Yi is very proud to introduce the release of Yi-34B model now on top of @huggingface pretrained LLM leaderboard! Also a Yi-6B available. Welcome to give a try and build fantastic projects!

6

25

163

12

12

198

@Tim_Dettmers

Tim Dettmers

3 years

An important but elusive quality to learn in a PhD is research style. It is valuable to be aware of this before you start a PhD. Among other updates, I added an extensive discussion on research style to my "choosing a grad school" blog post. Enjoy!

Tweet card media

How to Choose Your Grad School — Tim Dettmers

How to choose an advisor? Is school prestige important? How important are peers? I answer these questions to help you find the right grad school.

timdettmers.com

4

36

187

@Tim_Dettmers

Tim Dettmers

4 years

Making good progress on the updated GPU recommendation blog post. Have almost all data crunched and it seems that I will have pretty accurate estimates of performance. Will probably publish it Friday morning. If you have any more questions that I should include. Let me know!

12

8

186

@Tim_Dettmers

Tim Dettmers

1 year

Guanaco-33B holds up well. Controlled for the memory footprint, its the best model. Since it was trained in 4-bit, it uses as much memory as a regular 7B model. The memory needed during fine-tuning is 17x less, so a 7B model is much more expensive to fine-tune than Guanaco 33B.

@lmarena_ai

lmarena.ai (formerly lmsys.org)

1 year

We are excited to announce the first major release of the Chatbot Arena conversation dataset! - 33K conversations with pairwise human preferences - 20 SOTA models such as GPT-4, Claude, and LLaMA-based Vicuna - From 13K unique IPs in the wild - An additional 3K expert-level

Tweet media one

Tweet media two

14

176

724

4

24

187

@Tim_Dettmers

Tim Dettmers

1 year

One thing that I care about in bitsandbytes is to provide _broad_ accessibility to LLMs. GPUs up to 9 years old are supported by 4-bit inference in bitsandbytes and you will see good speedups.

@daryl_imagineai

Daryl

@daryl_imagineai

1 year

@Tim_Dettmers Wow! This just gave Volta cards a new lease on life: Testing with 4xV100S and a 30B~ model. Got a 3.2x speedup! 7-8 tokens per second is very usable for an interactive chat experience.

Tweet media one

0

1

19

4

16

186

@Tim_Dettmers

Tim Dettmers

7 years

Deep learning hardware limbo is the battle between @Nvidia vs @AMD vs @IntelNervana for the throne of deep learning hardware. Learn who might win and why #dlearn #nlproc #ai

6

82

180

@Tim_Dettmers

Tim Dettmers

3 years

Going to write another GPU blog post update in the coming days. Are there any GPU questions that you would like to have answered? Will include popular Q&A in the blog post.

30

10

184

@Tim_Dettmers

Tim Dettmers

1 year

Below highlights some problems with QLoRA (I should not have been so smirky😅), and I wanted to highlight some issues but also resolve some others. We integrated our QLoRA codebase with 5 other open-source codebases before release, and it seems we created some issues on the way🧵

@Tim_Dettmers

Tim Dettmers

1 year

No cons :)

8

7

86

1

28

182

@Tim_Dettmers

Tim Dettmers

6 years

A really nice blog post by @agrinh about recent progress in GANs and variational autoencoders. Gives a short overview about GANs and their problems and then dives deep into the newest methods from ICML2018.

0

47

182

@Tim_Dettmers

Tim Dettmers

5 years

After talking to many students about their grad school experience I compiled this blog post on "How to pick your grad school". I discuss all the important factors and details from contrasting but complementary perspectives. I hope it will be helpful!

Tweet card media

How to Choose Your Grad School — Tim Dettmers

How to choose an advisor? Is school prestige important? How important are peers? I answer these questions to help you find the right grad school.

timdettmers.com

6

46

178

@Tim_Dettmers

Tim Dettmers

1 year

Catch my talk on k-bit Inference Scaling Laws at the @ESFoMo workshop, ballroom A (fourth floor), 10:50am. Slides:

Tweet card media

The case for 4-bit precision: k-bit Inference Scaling Laws

The case for 4-bit precision: k-bit Inference Scaling Laws Tim Dettmers, Luke Zettlemoyer ICML 2023

docs.google.com

2

32

181

@Tim_Dettmers

Tim Dettmers

4 years

We have confirmation that Tensor Cores in RTX 30 GPUs will be limited to make Quadro / Tesla cards more attractive for deep learning. This is the same as in the RTX 20s series. I will update my performance figures later today and will post an update.

7

29

171

@Tim_Dettmers

Tim Dettmers

2 months

@CarnegieMellon won me over. It is an amazing place. Highly collaborative, very collegial, close-knit, with excellent students and great support. Looking forward to my time there! I will take 2-3 PhD students for Fall 2025. Please apply to the CMU PhD program to work with me.

7

9

178

@Tim_Dettmers

Tim Dettmers

3 years

This is pretty significant for custom CUDA code. Even with years of CUDA experience, it is very difficult to write peak performance matrix multiplication code. CUTLASS is great, but it seems Triton has better performance, is more customizable, and you can write code in Python.

@OpenAI

OpenAI

3 years

We’re releasing Triton 1.0, an open-source Python-like programming language for writing efficient GPU code. OpenAI researchers with no GPU programming experience have used Triton to produce kernels that are 2x faster than their PyTorch equivalents.

117

1K

6K

4

19

175

@Tim_Dettmers

Tim Dettmers

2 years

I am currently preparing a new GPU blog post updated for the RTX 4090 etc. I am collecting some Q&A questions. If you have any questions that you would like me to answer in the blog post, please leave them here as a comment.

26

9

170

@Tim_Dettmers

Tim Dettmers

10 months

@srush_nlp @4evaBehindSOTA Regular transformers are notoriously difficult to sparsify, this is even true for the FFN layers in MoE transformers. But MoE layers are very different. You can also quantize them to 1 bit without any problem, but sparsification gives you better memory benefits than 1-bit quant.

5

22

169

@Tim_Dettmers

Tim Dettmers

1 year

Just pushed a major CUDA-related update to pip for bnb. I need feedback because it's so difficult to test CUDA envs. It will either fix 90% of all CUDA issues, or fix 90% of issues and create many new ones 🫠. Please let me know if it works for you. I am ready to hotfix things.

7

15

168

@Tim_Dettmers

Tim Dettmers

2 months

The six months on the academic job market were brutal but also very successful. More than 125 individual interviews across 17 universities leading to 15 job offers. It was a unique experience for which I am very grateful for. I will write up my learnings and insights soon.

4

0

168

@Tim_Dettmers

Tim Dettmers

6 years

It seems that first data suggest that the RTX 2080 Ti deep learning performance is very close to Titan V performance. Also key facts: Tensor Cores are programmable and NVLink can be used for data (+50GB/s). NVLink makes PCIe lanes obsolete for parallelism.

8

42

167

@Tim_Dettmers

Tim Dettmers

1 year

Catch my posters today: SWARM parallelism (fault tolerant globally distributed): 11am, slot 217 k-bit Inference Scaling Laws (foundation of QLoRA and SpQR): 2pm, slot 824

Tweet card media

SWARM Parallelism: Training Large Models Can Be Surprisingly...

Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this...

0

34

166

@Tim_Dettmers

Tim Dettmers

2 years

Bitsandbytes now supports CUDA 12.1. Upcoming features are 8-bit Lion (contributed by lucidrains) and serialization of 8-bit layers (contributed by @m_ryabinin ) with which you can upload 8-bit models to @huggingface . Bitsandbytes is growing rapidly at 150k pip installs a week!

4

33

163

@Tim_Dettmers

Tim Dettmers

2 years

An update to our k-bit inference scaling laws paper: + Includes results for 175B OPT/BLOOM + Short analysis of scaling behavior of GPTQ + Better related work + Main takeaway: input-dependent quantization like GPTQ might unlock less than 4-bits.

3

40

162

@Tim_Dettmers

Tim Dettmers

3 years

Here a fun fact: You can interpret Adam as the exponentially smoothed signal-to-noise ratio with a separate sign matrix (direction of momentum). In other words: Large Adam updates are the parameters with gradients that have the largest signal and lowest noise over time. How? 1/n

1

13

160

@Tim_Dettmers

Tim Dettmers

3 months

The numbers on these inference GPU benchmarks seem low. Here are the theoretical values from my model of 8xB200 inference for NVLink, 8-bit, and 70B Llama model, which is closer to 300k tokens/s. This assumes perfect implementations (close to what OpenAI/Anthropic has).

Tweet media one

@Etched

Etched

3 months

Meet Sohu, the fastest AI chip of all time. With over 500,000 tokens per second running Llama 70B, Sohu lets you build products that are impossible on GPUs. One 8xSohu server replaces 160 H100s. Sohu is the first specialized chip (ASIC) for transformer models. By specializing,

Tweet media one

343

1K

6K

3

13

161

@Tim_Dettmers

Tim Dettmers

10 months

Looking forward to #NeurIPS2023 ! Would love to catch up with friends and colleagues. Please like this/DM me if you are coming too. Will give our QLoRA oral Tue, Dec 12 3:55pm. I am also on the job market this year — please promote my app at your university! More coming soon :)

5

8

161

@Tim_Dettmers

Tim Dettmers

6 years

Estimating performance and performance/dollar for GPUs has never been more complicated: Tensor Cores, 16-bit/32-bit, 3 generations of GPUs. However, I think these new charts will be quite accurate. Will update my blog post in the next days with more info on TPUs etc.

Tweet media one

Tweet media two

14

38

158

@Tim_Dettmers

Tim Dettmers

7 months

This is very exciting! Neural Magic has the best CPU kernels for LLM inference and now they are expanding to GPUs. Curious about their performance!

@neuralmagic

Neural Magic

7 months

Neural Magic is expanding to GPUs! Complementing our existing efforts with CPUs and model compression, we just launched nm-vllm, our initial community release to support GPU inference serving for LLMs. Details 👇

4

37

239

4

16

157

@Tim_Dettmers

Tim Dettmers

1 year

@abacaj When I designed the 8-bit routines, I designed them for training. The were never meant for inference, that is why it's slow. I have something designed for inference in the next two weeks which will be 5x faster than bf16.

6

14

155

@Tim_Dettmers

Tim Dettmers

1 year

The updated Petals is very exciting. Run 65B model at 5 tokens/s using colab. You can also fine-tune +100B models using colab. The more people adopt Petals, the easier and faster it will be to work with large models with minimal resources.

@m_ryabinin

Max Ryabinin

1 year

I've also written a slightly longer blog post about Petals, covering both alternative methods for LLM inference and details of using this system from the client/server perspective. If you wanted to have a high-level overview of Petals in a non-thread form, this post is for you!

1

8

58

5

21

152

@Tim_Dettmers

Tim Dettmers

1 year

Soon ...

Tweet media one

2

8

143

@Tim_Dettmers

Tim Dettmers

1 year

At ICML I will be presenting: - k-bit Inference Scaling Laws - SWARM Parallelism I will also present a keynote at the ES-FoMo workshop. Please get in touch if you are around and want to catch up or meet. (particularly, would love to meet those I follow but have not met before)

1

6

141

@Tim_Dettmers

Tim Dettmers

1 year

This is a big achievement in unifying vision and language to produce high-quality models. in CM3Leon everything are tokens and you just train a decoder model on both image and language tokens. In the future, the only training procedure that you need will be LLM-style pretraining.

@ArmenAgha

Armen Aghajanyan

1 year

I’m excited to release our most recent work setting a new SOTA FID of 4.88 on text-to-image generation we call CM3Leon (pronounced chameleon)!

Tweet media one

25

128

482

3

23

139

@Tim_Dettmers

Tim Dettmers

1 year

My slurm jobs right now:

Tweet media one

2

7

141

@Tim_Dettmers

Tim Dettmers

1 year

This is a very interesting analysis by @davis_yoshida about the NF4 data type from QLoRA. I discussed this with Davis, and he pointed out some issues which led to very curious findings: NF4 is not theoretically optimal but close to empirically optimal.🧵

@davis_yoshida

Davis Yoshida

1 year

I spent my memorial day weekend doing some thinking about quantization based on @TimDettmers et al's work: tl;dr: 1. 4-bit dtypes should vary with quantization block size 2. NF4 doesn't use code values uniformly 3. #2 is actually a good thing

1

24

91

2

14

140

@Tim_Dettmers

Tim Dettmers

1 year

@Teknium1 @yacineMTB @abacaj If we run the big models that come soon in 3-bit with QLoRA, you can quickly tune to anything you want. It will be so fast, you can adapt a model while you chat with it about a task/sub-domain. In the next session, you can either stack or learn a new adapter. It will be awesome!

3

17

137

@Tim_Dettmers

Tim Dettmers

5 months

@karpathy @natfriedman I wrote an entire blog post about this: The main question is if dendritic spikes contain information and if so how much. By today's evidence, it is pretty clear that dendritic spikes are important and a biological neuron is like multi-layer MLPs. With

Tweet card media

The Brain vs. Deep Learning vs. Singularity

This blog post compares deep learning to the brain and derives an estimate of computational power for the brain which is used to predict the singularity.

timdettmers.com

6

11

137

@Tim_Dettmers

Tim Dettmers

4 years

Last year I wrote a comprehensive blog post about choosing a grad school for a PhD. People found it very useful -- even for thinking about postdoc and faculty positions. I hope it will also be helpful for this year's students to make this tough decision:

Tweet card media

How to Choose Your Grad School — Tim Dettmers

How to choose an advisor? Is school prestige important? How important are peers? I answer these questions to help you find the right grad school.

timdettmers.com

3

31

138