Tim Dettmers Profile Banner
Tim Dettmers Profile
Tim Dettmers

@Tim_Dettmers

31,803
Followers
902
Following
126
Media
3,213
Statuses

Research Scientist @allen_ai and incoming professor @CarnegieMellon . I blog about deep learning and PhD life at .

Seattle, WA
Joined October 2012
Don't wanna be here? Send us removal request.
Pinned Tweet
@Tim_Dettmers
Tim Dettmers
2 months
After 7 months on the job market, I am happy to announce: - I joined @allen_ai - Professor at @CarnegieMellon from Fall 2025 - New bitsandbytes maintainer @Titus_vK My main focus will be to strengthen open-source for real-world problems and bring the best AI to laptops 🧵
152
85
2K
@Tim_Dettmers
Tim Dettmers
1 year
QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark: Paper: Code+Demo: Samples: Colab:
Tweet media one
90
948
4K
@Tim_Dettmers
Tim Dettmers
3 years
I am excited to share my latest work: 8-bit optimizers – a replacement for regular optimizers. Faster 🚀, 75% less memory 🪶, same performance📈, no hyperparam tuning needed 🔢. 🧵/n Paper: Library: Video:
Tweet media one
18
283
1K
@Tim_Dettmers
Tim Dettmers
1 year
@karpathy Super excited to push this even further: - Next week: bitsandbytes 4-bit closed beta that allows you to finetune 30B/65B LLaMA models on a single 24/48 GB GPU (no degradation vs full fine-tuning in 16-bit) - Two weeks: Full release of code, paper, and a collection of 65B models
39
193
1K
@Tim_Dettmers
Tim Dettmers
1 year
We present SpQR, which allows lossless LLM inference at 4.75 bits with a 15% speedup. You can run a 33B LLM on a single 24GB GPU fully lossless. SpQR works by isolating sensitive weights with higher precision and roughly doubles improvements from GPTQ: 🧵
Tweet media one
36
302
1K
@Tim_Dettmers
Tim Dettmers
2 years
We release LLM.int8(), the first 8-bit inference method that saves 2x memory and does not degrade performance for 175B models by exploiting emergent properties. Read More: Paper: Software: Emergence:
Tweet media one
17
250
1K
@Tim_Dettmers
Tim Dettmers
5 years
How can you successfully train transformers on small datasets like PTB and WikiText-2? Are LSTMs better on small datasets? I ran 339 experiments worth 568 GPU hours and came up with some answers. I do not have time to write a blog post, so here a twitter thread instead. 1/n
16
311
1K
@Tim_Dettmers
Tim Dettmers
2 years
We release the public beta for bnb-int8🟪 for all @huggingface 🤗models, which allows for Int8 inference without performance degradation up to scales of 176B params 📈. You can run OPT-175B/BLOOM-176B easily on a single machine 🖥️. You can try it here: 1/n
Tweet media one
27
228
923
@Tim_Dettmers
Tim Dettmers
4 years
Updated GPU recommendations for the new Ampere RTX 30 series are live! Performance benchmarks, architecture details, Q&A of frequently asked questions, and detailed explanations of how GPUs and Tensor Cores work for those that want to learn more:
Tweet media one
31
252
890
@Tim_Dettmers
Tim Dettmers
2 years
In the RTX 40 post, I introduce a GPU recommendation chart and discuss the new Tensor Memory Accelerator (TMA) and FP8 computation. Overall, RTX 40s are faster for inference and shine through their FP8 performance but are inefficient for 16-bit training.
Tweet media one
38
171
886
@Tim_Dettmers
Tim Dettmers
1 year
The 4-bit bitsandbytes private beta is here! Our method, QLoRA, is integrated with the HF stack and supports all models. You can finetune a 65B model on a single 48 GB GPU. This beta will help us catch bugs and issues before our full release. Sign up:
25
158
868
@Tim_Dettmers
Tim Dettmers
1 year
The result of long days of CUDA optimizations: the new bitsandbytes release includes 4-bit inference, which is up to 4.2x faster than 16-bit inference (bsz=1). Full HF integration for all models. No code change needed. Bnb is growing rapidly, just shy of 1M installs/month🧵
Tweet media one
24
144
865
@Tim_Dettmers
Tim Dettmers
2 years
Finished RTX 4090 modeling ...not good 😐. If you have an RTX 3090, probably best to wait 4 years for chiplets and consumer HBM. This is what dead Moore's law looks like. You can only scale cost/perf with features, but you can only add Tensor Cores once. We are stuck. More soon!
26
56
709
@Tim_Dettmers
Tim Dettmers
1 year
Just a reminder that the default hyperparameters of LoRA are performing poorly. You need to attach LoRA modules to all layers for it to perform as well as full fine-tuning. Once you do that, we find there is no difference between LoRA and fine-tuning.
@Shahules786
Shahul Es
1 year
LoRA is not a drop-in replacement for Full Finetuning. Even though it reduces the compute requirements by 3x it comes with certain limitations. The data preparation needed for both is also different. 🔑 - LoRA requires much more data to converge compared to full FT. This can be
Tweet media one
21
61
419
17
113
673
@Tim_Dettmers
Tim Dettmers
2 years
I got excited about a paper, implement stuff and then see they cheated: (1) Copy baseline results from other paper, (2) do much more hyperparam tuning on their own method, (3) accepted to EMNLP. Results look good, but their method is crap! Why waste people's time like this?
39
38
629
@Tim_Dettmers
Tim Dettmers
6 years
I just updated my full deep learning hardware for the latest recommendations and advice. I reframed the blog post to help you avoid the most costly mistakes when you are building a deep learning machine.
12
158
515
@Tim_Dettmers
Tim Dettmers
12 days
Open-source models beating closed models will become more and more common. Scaling has diminishing returns. The best solution will not have the largest scale but best approach or data. Especially with test-time compute, you do not need the best model to have the best solution.
@allen_ai
Ai2
12 days
Meet Molmo: a family of open, state-of-the-art multimodal AI models. Our best model outperforms proprietary systems, using 1000x less data. Molmo doesn't just understand multimodal data—it acts on it, enabling rich interactions in both the physical and virtual worlds. Try it
50
272
1K
12
81
514
@Tim_Dettmers
Tim Dettmers
4 years
I am curious why people are not talking more about the OpenAI scaling law papers. For me, they seem very significant. What I heard so far: "Too complicated. I don't understand and I don't care", "NLP is not physics". Other criticism? Any insights why people ignore it?
23
68
454
@Tim_Dettmers
Tim Dettmers
4 years
New GPUs have arrived, and they come with GDDR6X! You can expect a ~45% speed increase with the RTX 3090 vs RTX 2080 Ti. 3-slot-width is a problem though as is the fan-design. 4x RTX 2080 Ti >> 2x RTX 3090. 24GB mem is great, but RTX 3080 with 10GB is not very useful.
20
40
428
@Tim_Dettmers
Tim Dettmers
1 year
Our work on loss spikes and stable 8-bit CLIP training is the largest Int8 training to date (1B). We introduce the SwitchBack layers and StableAdamW to ensure stability at these scales. Work with the awesome @Mitchnw Paper: Colab:
Tweet media one
4
99
432
@Tim_Dettmers
Tim Dettmers
6 years
I just updated my GPU recommendation blog post! I included the RTX Titan and GTX 1660 Ti in my analysis. The analysis now separates word RNNs from char RNNs/Transformers. I also recommend TPUs for larger transformers/CNNs. This and more in the update:
6
128
428
@Tim_Dettmers
Tim Dettmers
7 years
Out of nowhere: far better translator than Google. Begs the question: Can Google be overtaken in search too? #dlearn
Tweet media one
20
231
387
@Tim_Dettmers
Tim Dettmers
1 year
Looking at the comments, some people missed the Guanaco-33B demo because it was added later: Big thanks to @huggingface for sponsoring this demo! The second thing I noticed was that people were a bit lost on how to use the adapters. So here a tutorial🧵
@Tim_Dettmers
Tim Dettmers
1 year
QLoRA: 4-bit finetuning of LLMs is here! With it comes Guanaco, a chatbot on a single GPU, achieving 99% ChatGPT performance on the Vicuna benchmark: Paper: Code+Demo: Samples: Colab:
Tweet media one
90
948
4K
11
73
392
@Tim_Dettmers
Tim Dettmers
25 days
Surprisingly many details here (for OpenAi-level secrecy) of how they build the model.
@OpenAI
OpenAI
25 days
Some of our researchers behind OpenAI o1 🍓
232
841
7K
12
27
397
@Tim_Dettmers
Tim Dettmers
2 years
This is the main driving assumption of my research and it is still holding up after 10 years: Humans are not special, scale is. The other main fact (sparsity): Humans are not special, but primates are. Only primates and birds have neurons not proportional to their body size
@SilverVVulpes
Siberian fox🔸
2 years
People claimed the human brain was special relative to other primates in the size of the temporal lobes, involved in functions such as language. Newer data once again shows that no, the human brain is just a scaled up primate brain
Tweet media one
Tweet media two
64
211
1K
15
43
384
@Tim_Dettmers
Tim Dettmers
4 years
Turns out a lot of open-domain QA datasets have test set leakage. If you control for it, model performance drops by a mean absolute of 63%. Yikes! If we missed this for such a long time, I wonder if there are problems with other NLP datasets too.
4
97
374
@Tim_Dettmers
Tim Dettmers
8 months
@karpathy I have also seen this before. I think it's the psychology of material coming all at once that can be overwhelming for newcomers. If one builds up things bit-by-bit there not this overwhelming feeling of "this is too much; I am not good enough to learn this".
6
6
361
@Tim_Dettmers
Tim Dettmers
2 years
We ran +35,000 zero-shot experiments for our work on k-bit Inference Scaling Laws📈. A 30B 8-bit and 60B 4-bit LLM have the same model bits/inference latency, but different zero-shot accuracy. What is the best trade-off? The answer is clear: 4-bit is best
Tweet media one
6
66
348
@Tim_Dettmers
Tim Dettmers
4 years
A friend asked me for a reference for why we did not increase the frequency further in CPUs and why parallelism was necessary to increase performance. This puts it quite bluntly (from ).
Tweet media one
4
65
343
@Tim_Dettmers
Tim Dettmers
2 years
Just finished the final update for the RTX 40 GPU blog post: - Performance/$ now includes Total Cost of Ownership in cost estimate (computer + 5y electricity) - Discussion Async copy vs. TMA - Small update on FP8 training - Font and figure improvements
Tweet media one
12
58
342
@Tim_Dettmers
Tim Dettmers
6 months
This is excellent work — a big step forward in quantization! It enables full 4-bit matmuls, which can speed up large batch inference by a lot. Anyone deploying LLMs at scale will soon use this or similar techniques.
@AshkboosSaleh
Saleh Ashkboos
6 months
[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. With @akmohtashami_a @max_croci @DAlistarh @thoefler @jameshensman and others Paper: Code:
Tweet media one
7
67
313
5
58
342
@Tim_Dettmers
Tim Dettmers
7 months
This model can automatically debug CUDA version errors. AGI achieved ✅😂
@cognition_labs
Cognition
7 months
3/4 Devin can train and fine tune its own AI models.
19
142
2K
5
27
336
@Tim_Dettmers
Tim Dettmers
6 years
This work presents strong and rigorous evidence that we should abandon RNNs and move on to using convolutions for sequence modeling. I also made similar experiences in other domains such as graph embeddings and knowledge compression. Definitely an important read!
7
77
328
@Tim_Dettmers
Tim Dettmers
1 year
The latest release of bitsandbytes has an improved CUDA setup and A100 4-bit inference. I thought that A40 and A100 GPUs were close enough, and optimized for A40s, but they are very different. A100 performance is now 40% faster with a small hit for other GPUs.
Tweet media one
9
54
322
@Tim_Dettmers
Tim Dettmers
6 years
I updated my guide with new GPU recommendations: RTX 2080 most cost-efficient choice. GTX 1080/1070 (+Ti) cards remain very good choices, especially as prices drop. Some discussion on TPUs/AWS — can be good in some cases.
Tweet media one
9
125
324
@Tim_Dettmers
Tim Dettmers
1 year
Did some optimizations for Ada/Ampere/Turing for 4-bit inference (bsz=1, arbitrary datatype e.g. NF4). It is now 3.71x, 3.13x, and 1.72x speedup vs 16-bit. The expected max would be 3.55x if NVIDIA kernels were 100% efficient. Will be released on Monday (no code change needed).
9
34
313
@Tim_Dettmers
Tim Dettmers
1 year
Continued pretraining with QLoRA is just around the corner! A second pretraining of models like Falcon-40B in 4-bit would be super-efficient.
@guitaricet
Vlad Lialin
1 year
Parameter-efficient fine-tuning revolutionized the accessibility of LLM fine-tuning, but can they also revolutionize pre-training? We present ReLoRA — the first PEFT method that can be used for training from scratch! 🔥🔥
Tweet media one
14
217
884
9
42
305
@Tim_Dettmers
Tim Dettmers
1 year
I never had time to do the proper bitsandbytes 4-bit release. The 0.39 release includes the 4-bit quantization variants and CUDA kernels, paged optimizers, Lion, as well as an important bugfix for a memory leak in 8-bit training/inference .
Tweet media one
7
34
298
@Tim_Dettmers
Tim Dettmers
9 months
@typedfemale Yes, it is a big problem. I really want to create a class for machine learning systems that also has an emphasis on CUDA programming for deep learning. So many people were interested in this. I will probably get on this once I finish the faculty application process.
7
7
303
@Tim_Dettmers
Tim Dettmers
10 months
Today, I will give a talk about "The making of QLoRA" at the LLM Efficiency Challenge at 2:30pm, Room 356. I will also talk a bit about how I go about doing research, running experiments and figuring out "what works".
13
26
302
@Tim_Dettmers
Tim Dettmers
1 year
A major bug in 8-bit optimizers that could cause some instabilities later in training has been fixed. Please update bitsandbytes to 0.41.1 via `pip install -U bitsandbytes`. Now 8-bit optimizer should again reproduce 32-bit optimizer performance.
12
46
291
@Tim_Dettmers
Tim Dettmers
3 years
I have been working on 8-bit optimizers, and I am looking for testers for the initial release to test installation and ease of use. Uses up to 63% less GPU memory, faster/stabler training while maintaining performance. Currently, 8-bit Adam and 8-bit Momentum are supported. 1/5
Tweet media one
4
52
282
@Tim_Dettmers
Tim Dettmers
5 years
My new work with @LukeZettlemoyer on accelerated training of sparse networks from random weights to dense performance levels — no retraining required! Paper: Blog post: Code:
Tweet media one
2
81
277
@Tim_Dettmers
Tim Dettmers
11 months
Bitsandbytes now supports 4-bit store/load of any model. Load in 4-bit via: from_pretrained(name, ..., load_in_4bit=True, device_map='auto') Then save/push the model to the hub. Get the newest bnb: pip install -U bitsandbytes Implemented by Ruslan Svirschevski (gh: poedator).
4
38
274
@Tim_Dettmers
Tim Dettmers
2 years
We now have Int8 backprop support for all GPUs for bitsandbytes! Now available via 'pip install bitsandbytes'. This was a contribution from @sasha_borzunov . We will release Int8 fine-tuning for all @huggingface models soon — stay tuned!
6
40
270
@Tim_Dettmers
Tim Dettmers
2 years
The GPU blog post update is 90% done now. I think tomorrow morning, we will have an update! 🚀
7
5
260
@Tim_Dettmers
Tim Dettmers
9 months
We made a QLoRA promo video for @UWITNews . It is a very nice summary of the motivation behind QLoRA and what the environment was like to develop this research. @uwcse is a perfect place for doing such research! Article: Youtube:
3
44
256
@Tim_Dettmers
Tim Dettmers
1 year
@HamelHusain If you wait for another two weeks, we have something nice for you ;) With the right methods you can fine-tune a 30B model on that GPU. A 30B policy with 30B value function also works for RLHF.
16
15
251
@Tim_Dettmers
Tim Dettmers
9 months
The 0.42.0 bitsandbytes release adds 4-bit serialization, so you can save/load 4-bit weights directly. Otherwise, there are lots of bug fixes. Thank you, contributors! The next goal is Apple/AMD/Intel and Windows integration. We now have 1.5M installs per month.
Tweet media one
10
35
251
@Tim_Dettmers
Tim Dettmers
8 months
@willie_agnew Literally curing cancer. I talked to a biologist who used my methods in conjunction with open models to develop new methods for drug discovery. They developed drugs for previously incurable pediatric cancers. These are real wet lab in vitro results — it just works.
16
12
251
@Tim_Dettmers
Tim Dettmers
9 months
An excellent end-to-end guide for finetuning. It has all the details from data prep to deployment. If you want to finetune, this is a great resource to get started.
@_philschmid
Philipp Schmid
9 months
What's the best way to fine-tune open LLMs in 2024? Look no further! 👀 I am excited to share “How to Fine-Tune LLMs in 2024 with Hugging Face” using the latest research techniques, including Flash Attention, Q-LoRA, @OpenAI dataset formats (messages), ChatML, Packing, all built
20
171
852
3
45
243
@Tim_Dettmers
Tim Dettmers
5 years
This is really great work! For layer 5 pyramidal neurons: A dendritic branch = MLP with 1 layer, 4 units; the entire neuron = MLP with 7 layers, 128 units each. One bio neuron > most MNIST models. We have about 85bn neurons in total and >1tn dendrites — that is a lot of compute!
@DavidBeniaguev
David Beniaguev
5 years
A story of a Cortical Neuron as a Deep Artificial Neural Net: 1) Neurons in the brain are bombarded with massive synaptic input distributed across a large tree like structure - its dendritic tree. During this bombardment, the tree goes wild preprint:
23
659
2K
7
66
241
@Tim_Dettmers
Tim Dettmers
8 years
I updated my GPU advice blog post with the GTX 1080 Ti; also cleaned it so it is easier to find relevant information
2
98
240
@Tim_Dettmers
Tim Dettmers
2 years
FP8 training works well and has large benefits. It has steep networking requirements to achieve good utilization but there are solutions to that too (we will release one in the next days). It's a big shift and everyone with RTX 40s / H100 GPUs should look into FP8 training.
@NamanGoyal21
Naman Goyal
2 years
Its crazy that, at 60% Model FLOPS (FP8) Utilization on H100, original GPT3 configuration can be trained in 3 days on 1024 H100s and PaLM on 12 days on 2048 H100s. That's roughly 50x lesser gpu hours for GPT3 paper 3 years back, and 9x lesser for palm released 9 months back
13
35
390
7
20
226
@Tim_Dettmers
Tim Dettmers
3 years
I am a huge fan of einsum notation. Here is a multi-layer transformer in a couple lines of code (without norms though). I think it's simple to read, but whenever I show this to somebody in excitement they do not like it. I am curious: How is that for you? Easy to read or not?
Tweet media one
30
22
236
@Tim_Dettmers
Tim Dettmers
1 year
I think it will take another day or two for the full integration, but the kernels (batch size 1) are ready.
Tweet media one
5
15
228
@Tim_Dettmers
Tim Dettmers
1 year
If you are merging adapters with QLoRA 4-bit weights, please use the gist below for merging. This will increase the performance of the QLoRA model. I think I have seen a PR on PEFT, so this will soon come to PEFT by default, but for now, its better to merge it in this way.
@chris_hayduk1
Chris Hayduk
1 year
Just put together a gist for merging QLoRA with the quantized model weights as mentioned by @Tim_Dettmers @Teknium1 @erhartford Since I know you guys were looking into it. Should be able to quantize the whole thing after this without issue
5
13
92
2
26
221
@Tim_Dettmers
Tim Dettmers
1 year
I forgot how much better Guanaco-65B is compared to 33B. You can try here via Petals (globally distributed inference): With Petals, you can also run a 65B model in a colab or locally on a small GPU at ~5 tokens/sec (see below).
@m_ryabinin
Max Ryabinin
1 year
@fernavid @Tim_Dettmers We've updated one just today: this notebook shows you: - how to run a 65B model from Colab, - how to plug in adapters between its layers, - and how to write custom generation methods (you can't do this with an API)
3
42
246
2
36
219
@Tim_Dettmers
Tim Dettmers
2 years
bitsandbytes is on track to surpass half a million pip installs this month! Upcoming features: - LLM.int8() support for all GPUs - Int8 backward for fine-tuning - Fast 4-bit float (FP4) kernels for inference Always looking for more people to get involved. There is lots to do!
6
15
213
@Tim_Dettmers
Tim Dettmers
1 year
The 0.38.0 release of bitsandbytes introduces: - 8-bit Lion which is 8x more memory efficient than standard Adam - Serialization of 8-bit layers now allows storing/loading 8-bit models to/from the HF Hub We are now at half a million installs per month!
Tweet media one
6
39
205
@Tim_Dettmers
Tim Dettmers
4 years
I have my first draft right now: 7,000 words. It will be quite a comprehensive post. If you have any more GPU-related questions for me, right now is the last chance for them to be added. I will freeze the draft tonight, rewrite tomorrow, and then publish it on Monday.
15
13
203
@Tim_Dettmers
Tim Dettmers
6 years
I am very excited and proud to announce that I will join UW as a PhD student this fall. I will work with Yejin Choi on common sense knowledge and reasoning. I believe that with common sense, intelligent machines will be able to benefit everyone equally.
13
3
203
@Tim_Dettmers
Tim Dettmers
4 years
Just working on an update to my GPU recommendation blog post. Will cover the new GPUs and focus on the cost-effectiveness of 2/4/8 GPU systems. Any other things that you would like to see discussed?
19
11
197
@Tim_Dettmers
Tim Dettmers
11 months
This model flew under the radar. It has the highest MMLU score of any open-source model. I have not tried it myself but I am curious how it compares to other models when evaluated across a broad range of tasks. Can somebody give it a try?
@01AI_Yi
Yi-01.AI
11 months
Our team at @01AI_Yi is very proud to introduce the release of Yi-34B model now on top of @huggingface pretrained LLM leaderboard! Also a Yi-6B available. Welcome to give a try and build fantastic projects!
6
25
163
12
12
198
@Tim_Dettmers
Tim Dettmers
3 years
An important but elusive quality to learn in a PhD is research style. It is valuable to be aware of this before you start a PhD. Among other updates, I added an extensive discussion on research style to my "choosing a grad school" blog post. Enjoy!
4
36
187
@Tim_Dettmers
Tim Dettmers
4 years
Making good progress on the updated GPU recommendation blog post. Have almost all data crunched and it seems that I will have pretty accurate estimates of performance. Will probably publish it Friday morning. If you have any more questions that I should include. Let me know!
12
8
186
@Tim_Dettmers
Tim Dettmers
1 year
Guanaco-33B holds up well. Controlled for the memory footprint, its the best model. Since it was trained in 4-bit, it uses as much memory as a regular 7B model. The memory needed during fine-tuning is 17x less, so a 7B model is much more expensive to fine-tune than Guanaco 33B.
@lmarena_ai
lmarena.ai (formerly lmsys.org)
1 year
We are excited to announce the first major release of the Chatbot Arena conversation dataset! - 33K conversations with pairwise human preferences - 20 SOTA models such as GPT-4, Claude, and LLaMA-based Vicuna - From 13K unique IPs in the wild - An additional 3K expert-level
Tweet media one
Tweet media two
14
176
724
4
24
187
@Tim_Dettmers
Tim Dettmers
1 year
One thing that I care about in bitsandbytes is to provide _broad_ accessibility to LLMs. GPUs up to 9 years old are supported by 4-bit inference in bitsandbytes and you will see good speedups.
@Tim_Dettmers Wow! This just gave Volta cards a new lease on life: Testing with 4xV100S and a 30B~ model. Got a 3.2x speedup! 7-8 tokens per second is very usable for an interactive chat experience.
Tweet media one
0
1
19
4
16
186
@Tim_Dettmers
Tim Dettmers
7 years
Deep learning hardware limbo is the battle between @Nvidia vs @AMD vs @IntelNervana for the throne of deep learning hardware. Learn who might win and why #dlearn #nlproc #ai
6
82
180
@Tim_Dettmers
Tim Dettmers
3 years
Going to write another GPU blog post update in the coming days. Are there any GPU questions that you would like to have answered? Will include popular Q&A in the blog post.
30
10
184
@Tim_Dettmers
Tim Dettmers
1 year
Below highlights some problems with QLoRA (I should not have been so smirky😅), and I wanted to highlight some issues but also resolve some others. We integrated our QLoRA codebase with 5 other open-source codebases before release, and it seems we created some issues on the way🧵
@Tim_Dettmers
Tim Dettmers
1 year
No cons :)
8
7
86
1
28
182
@Tim_Dettmers
Tim Dettmers
6 years
A really nice blog post by @agrinh about recent progress in GANs and variational autoencoders. Gives a short overview about GANs and their problems and then dives deep into the newest methods from ICML2018.
0
47
182
@Tim_Dettmers
Tim Dettmers
5 years
After talking to many students about their grad school experience I compiled this blog post on "How to pick your grad school". I discuss all the important factors and details from contrasting but complementary perspectives. I hope it will be helpful!
6
46
178
@Tim_Dettmers
Tim Dettmers
4 years
We have confirmation that Tensor Cores in RTX 30 GPUs will be limited to make Quadro / Tesla cards more attractive for deep learning. This is the same as in the RTX 20s series. I will update my performance figures later today and will post an update.
7
29
171
@Tim_Dettmers
Tim Dettmers
2 months
@CarnegieMellon won me over. It is an amazing place. Highly collaborative, very collegial, close-knit, with excellent students and great support. Looking forward to my time there! I will take 2-3 PhD students for Fall 2025. Please apply to the CMU PhD program to work with me.
7
9
178
@Tim_Dettmers
Tim Dettmers
3 years
This is pretty significant for custom CUDA code. Even with years of CUDA experience, it is very difficult to write peak performance matrix multiplication code. CUTLASS is great, but it seems Triton has better performance, is more customizable, and you can write code in Python.
@OpenAI
OpenAI
3 years
We’re releasing Triton 1.0, an open-source Python-like programming language for writing efficient GPU code. OpenAI researchers with no GPU programming experience have used Triton to produce kernels that are 2x faster than their PyTorch equivalents.
117
1K
6K
4
19
175
@Tim_Dettmers
Tim Dettmers
2 years
I am currently preparing a new GPU blog post updated for the RTX 4090 etc. I am collecting some Q&A questions. If you have any questions that you would like me to answer in the blog post, please leave them here as a comment.
26
9
170
@Tim_Dettmers
Tim Dettmers
10 months
@srush_nlp @4evaBehindSOTA Regular transformers are notoriously difficult to sparsify, this is even true for the FFN layers in MoE transformers. But MoE layers are very different. You can also quantize them to 1 bit without any problem, but sparsification gives you better memory benefits than 1-bit quant.
5
22
169
@Tim_Dettmers
Tim Dettmers
1 year
Just pushed a major CUDA-related update to pip for bnb. I need feedback because it's so difficult to test CUDA envs. It will either fix 90% of all CUDA issues, or fix 90% of issues and create many new ones 🫠. Please let me know if it works for you. I am ready to hotfix things.
7
15
168
@Tim_Dettmers
Tim Dettmers
2 months
The six months on the academic job market were brutal but also very successful. More than 125 individual interviews across 17 universities leading to 15 job offers. It was a unique experience for which I am very grateful for. I will write up my learnings and insights soon.
4
0
168
@Tim_Dettmers
Tim Dettmers
6 years
It seems that first data suggest that the RTX 2080 Ti deep learning performance is very close to Titan V performance. Also key facts: Tensor Cores are programmable and NVLink can be used for data (+50GB/s). NVLink makes PCIe lanes obsolete for parallelism.
8
42
167
@Tim_Dettmers
Tim Dettmers
1 year
Catch my posters today: SWARM parallelism (fault tolerant globally distributed): 11am, slot 217 k-bit Inference Scaling Laws (foundation of QLoRA and SpQR): 2pm, slot 824
0
34
166
@Tim_Dettmers
Tim Dettmers
2 years
Bitsandbytes now supports CUDA 12.1. Upcoming features are 8-bit Lion (contributed by lucidrains) and serialization of 8-bit layers (contributed by @m_ryabinin ) with which you can upload 8-bit models to @huggingface . Bitsandbytes is growing rapidly at 150k pip installs a week!
4
33
163
@Tim_Dettmers
Tim Dettmers
2 years
An update to our k-bit inference scaling laws paper: + Includes results for 175B OPT/BLOOM + Short analysis of scaling behavior of GPTQ + Better related work + Main takeaway: input-dependent quantization like GPTQ might unlock less than 4-bits.
3
40
162
@Tim_Dettmers
Tim Dettmers
3 years
Here a fun fact: You can interpret Adam as the exponentially smoothed signal-to-noise ratio with a separate sign matrix (direction of momentum). In other words: Large Adam updates are the parameters with gradients that have the largest signal and lowest noise over time. How? 1/n
1
13
160
@Tim_Dettmers
Tim Dettmers
3 months
The numbers on these inference GPU benchmarks seem low. Here are the theoretical values from my model of 8xB200 inference for NVLink, 8-bit, and 70B Llama model, which is closer to 300k tokens/s. This assumes perfect implementations (close to what OpenAI/Anthropic has).
Tweet media one
@Etched
Etched
3 months
Meet Sohu, the fastest AI chip of all time. With over 500,000 tokens per second running Llama 70B, Sohu lets you build products that are impossible on GPUs. One 8xSohu server replaces 160 H100s. Sohu is the first specialized chip (ASIC) for transformer models. By specializing,
Tweet media one
343
1K
6K
3
13
161
@Tim_Dettmers
Tim Dettmers
10 months
Looking forward to #NeurIPS2023 ! Would love to catch up with friends and colleagues. Please like this/DM me if you are coming too. Will give our QLoRA oral Tue, Dec 12 3:55pm. I am also on the job market this year — please promote my app at your university! More coming soon :)
5
8
161
@Tim_Dettmers
Tim Dettmers
6 years
Estimating performance and performance/dollar for GPUs has never been more complicated: Tensor Cores, 16-bit/32-bit, 3 generations of GPUs. However, I think these new charts will be quite accurate. Will update my blog post in the next days with more info on TPUs etc.
Tweet media one
Tweet media two
14
38
158
@Tim_Dettmers
Tim Dettmers
7 months
This is very exciting! Neural Magic has the best CPU kernels for LLM inference and now they are expanding to GPUs. Curious about their performance!
@neuralmagic
Neural Magic
7 months
Neural Magic is expanding to GPUs! Complementing our existing efforts with CPUs and model compression, we just launched nm-vllm, our initial community release to support GPU inference serving for LLMs. Details 👇
4
37
239
4
16
157
@Tim_Dettmers
Tim Dettmers
1 year
@abacaj When I designed the 8-bit routines, I designed them for training. The were never meant for inference, that is why it's slow. I have something designed for inference in the next two weeks which will be 5x faster than bf16.
6
14
155
@Tim_Dettmers
Tim Dettmers
1 year
The updated Petals is very exciting. Run 65B model at 5 tokens/s using colab. You can also fine-tune +100B models using colab. The more people adopt Petals, the easier and faster it will be to work with large models with minimal resources.
@m_ryabinin
Max Ryabinin
1 year
I've also written a slightly longer blog post about Petals, covering both alternative methods for LLM inference and details of using this system from the client/server perspective. If you wanted to have a high-level overview of Petals in a non-thread form, this post is for you!
1
8
58
5
21
152
@Tim_Dettmers
Tim Dettmers
1 year
Soon ...
Tweet media one
2
8
143
@Tim_Dettmers
Tim Dettmers
1 year
At ICML I will be presenting: - k-bit Inference Scaling Laws - SWARM Parallelism I will also present a keynote at the ES-FoMo workshop. Please get in touch if you are around and want to catch up or meet. (particularly, would love to meet those I follow but have not met before)
1
6
141
@Tim_Dettmers
Tim Dettmers
1 year
This is a big achievement in unifying vision and language to produce high-quality models. in CM3Leon everything are tokens and you just train a decoder model on both image and language tokens. In the future, the only training procedure that you need will be LLM-style pretraining.
@ArmenAgha
Armen Aghajanyan
1 year
I’m excited to release our most recent work setting a new SOTA FID of 4.88 on text-to-image generation we call CM3Leon (pronounced chameleon)!
Tweet media one
25
128
482
3
23
139
@Tim_Dettmers
Tim Dettmers
1 year
My slurm jobs right now:
Tweet media one
2
7
141
@Tim_Dettmers
Tim Dettmers
1 year
This is a very interesting analysis by @davis_yoshida about the NF4 data type from QLoRA. I discussed this with Davis, and he pointed out some issues which led to very curious findings: NF4 is not theoretically optimal but close to empirically optimal.🧵
@davis_yoshida
Davis Yoshida
1 year
I spent my memorial day weekend doing some thinking about quantization based on @TimDettmers et al's work: tl;dr: 1. 4-bit dtypes should vary with quantization block size 2. NF4 doesn't use code values uniformly 3. #2 is actually a good thing
1
24
91
2
14
140
@Tim_Dettmers
Tim Dettmers
1 year
@Teknium1 @yacineMTB @abacaj If we run the big models that come soon in 3-bit with QLoRA, you can quickly tune to anything you want. It will be so fast, you can adapt a model while you chat with it about a task/sub-domain. In the next session, you can either stack or learn a new adapter. It will be awesome!
3
17
137
@Tim_Dettmers
Tim Dettmers
5 months
@karpathy @natfriedman I wrote an entire blog post about this: The main question is if dendritic spikes contain information and if so how much. By today's evidence, it is pretty clear that dendritic spikes are important and a biological neuron is like multi-layer MLPs. With
6
11
137
@Tim_Dettmers
Tim Dettmers
4 years
Last year I wrote a comprehensive blog post about choosing a grad school for a PhD. People found it very useful -- even for thinking about postdoc and faculty positions. I hope it will also be helpful for this year's students to make this tough decision:
3
31
138