Casper Hansen @casper_hansen_ Twitter profile | Pikagi

Pikagi

Casper Hansen

@casper_hansen_

2,688

Followers

210

Following

114

Media

1,401

Statuses

NLP Scientist | AutoAWQ Creator | Open-Source Contributor

https://t.co/MInf4sYGZj

Joined August 2019

Don't wanna be here? Send us removal request.

Pinned Tweet

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

Hassabis: "Much of science can be summed up like this: Imagine all the knowledge in the world as a tree. Today's civilization knows only a tiny part of that tree. I think of AI as a tool that will enable scientists to one day explore the entire tree."

2

0

12

Last Seen Profiles

@lalaliccious

@PPoolsem

@Yunaham_nico25

@ikoi777

@asami_npm

@ocean_division

@camstayfaded

@LuGamoradecor

@KrashHuskyAD

@ulic_hele

@xr8dr

@LACEYGA3MES

@____Max_Bet_777

@ilhwui_1647

@alwazeer_ali3

@avaxtechsupport

@Roltas01

@CDABrabant

@annehelen

@LaytonEWilliams

@Rocketballs7

@mai_la18

@TampaTarpons

@doramislo

@tcs_na

@Blackie_Chan897

@ZYZZerc20

@LittleMissFlint

@sirokuro60

@bankshotbb

@Shampooeieikkk

@MrXylax

@MavSoo

@FiveFifths

@_Mikipeen

@jb59603

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

Apple released a 7B model that beats Mistral 7B - but the kicker is that they fully open sourced everything, also the pretraining dataset 🤯

Tweet card media

apple/DCLM-7B · Hugging Face

29

502

3K

@casper_hansen_

Casper Hansen

@casper_hansen_

3 months

People on X: US is way ahead of China in AI. Me: China is way ahead of US, actually not even close. China: Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-38B

Tweet card media

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte...

This paper introduces the MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS), designed to enhance performance in...

34

68

627

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

Mistral secretly released a new model called "next". Only available on the @lmsysorg

Tweet media one

9

35

365

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

Finetuning: 10k multi-turn conversations is all you need. The Yi paper explains that high-quality data beats quantity in fine-tuning and that you can beat other datasets that include 100-900k conversations just by focusing on quality in 10k conversations.

Tweet media one

13

49

359

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

What if vLLM could be 3x faster? Soon to be a reality with FlashInfer! From the maintainer: "We are talking to the FlashInfer team and working on merging it with vLLM!"

Tweet media one

9

20

208

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

The velocity of development on OpenDevin is incredible. In under a week, it has taken the throne from Devika in GitHub stars, and it also seems to be the most stable/usable of the two. Recommended you try it:

Tweet card media

GitHub - All-Hands-AI/OpenHands: 🙌 OpenHands: Code Less, Make More

🙌 OpenHands: Code Less, Make More. Contribute to All-Hands-AI/OpenHands development by creating an account on GitHub.

3

37

193

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

AWQ now runs up to 2.66x faster in vLLM after my PR was merged. Thanks to @woosuk_k for a quick review and merge 🥳 I hope we can bring more of these optimizations to the community so we can run models on smaller budgets!🤗

Tweet media one

9

25

193

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

vLLM 0.5.3 is out which means 1.56-2.59x speedup for AWQ models. What do you need to do? Nothing - it's automatically applied. Just pip install vllm==0.5.3.

Tweet media one

4

25

154

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

Flash Attention v3 is here with 6996 new lines of code introduced. 1.5-2.0x faster than v2 🤯

Tweet card media

GitHub - Dao-AILab/flash-attention: Fast and memory-efficient exact attention

Fast and memory-efficient exact attention. Contribute to Dao-AILab/flash-attention development by creating an account on GitHub.

2

20

152

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

It's not widely understood how @MistralAI make their MoE models. Just putting this out there: - train dense model, e.g. 7B, 22B, 123B. - use sparse upcycling to create MoE from dense model - train MoE Benefits include stable training of the MoE.

Tweet media one

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

@ziv_ravid It's actually very simple. First you train the dense model, then you do sparse upcycling to get an MoE. So the next step for Mistral it to train 8x123B.

3

1

30

7

17

142

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

This year is going to be amazing: 256K context total, 140K context in ONE GPU???? Close performance to Mixtral while having 3x higher throughput. Apache 2.0. What a release!

@AI21Labs

AI21 Labs

5 months

Introducing Jamba, our groundbreaking SSM-Transformer open model! As the first production-grade model based on Mamba architecture, Jamba achieves an unprecedented 3X throughput and fits 140K context on a single GPU. 🥂Meet Jamba 🔨Build on @huggingface

Tweet media one

37

252

1K

1

10

140

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

I did some research on LLM as agents today. Here is a guide to the state-of-the-art of LLMs as agents! It's all about environments where LLMs can observe, plan, act, and iterate on solutions. 🧵1/8

Tweet media one

4

14

123

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

I was coding yesterday with GPT-4o, Claude 3.5, and DeepSeek V2. Unsurprisingly, Claude won so grossly in my attempts to optimize code. GPT-4o and DeepSeek V2 is roughly on the same level for coding though. Very problem dependent on who wins.

2

6

110

@casper_hansen_

Casper Hansen

@casper_hansen_

30 days

First training run with Quiet-STaR 🍓. Initially, I am going with 12 thought tokens and 4 talk tokens. I was able to squeeze in a batch size of 64 on 4x A40 48GB GPUs.

Tweet media one

3

9

108

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

AWQ will soon run up to 2.59x faster in vLLM on large models thanks to @neuralmagic

Tweet media one

5

16

107

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

I can't help but wonder if OpenAI's Q* thinks for every token or if the model learns when to think? Once you go into thinking mode with Quiet-STaR, it needs to generate up to 16 tokens before continuing.

Tweet media one

@apples_jimmy

Jimmy Apples 🍎/acc

1 month

Little birdies told me 🍓 / Q* hasn’t released yet as they aren’t happy with the latency and other little things they want to further optimise. Not sure if big patience or small patience. Either way, @ChatGPTapp Shhh patience 🤫 🤫 🤫

0

74

902

2

5

105

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

The next generation of models seem to mostly target infinite context and adaptive compute per token. Basically, these two papers: Google: Mixture of Depths Google: Infini-Attention

5

11

93

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

I highly recommend watching this 20 minute video on SWE-Agent where they go through a full problem and how it's solved using special tools.

Tweet card media

SWE-agent: A deep dive

SWE-agent autonomously turns GitHub issues into pull requests using GPT-4. I'm one of the co-authors and I'm super excited to explain how SWE-agent works and...

www.youtube.com

2

24

91

@casper_hansen_

Casper Hansen

@casper_hansen_

8 months

Since the summer of 2023, I have been working hard on AutoAWQ. You can now quantize Mixtral 8x7B, LLaVa, and other models and run inference through the transformers integration. 🧵 1/5

4

14

89

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

Self-Taught Reasoner. Q*. Project Strawberry. Lots to talk about here. I have been thinking about hacking together a real solution for this. Both training with Huggingface libraries and inference in vLLM. What should one name such a project? OpenSTaR?

6

6

87

@casper_hansen_

Casper Hansen

@casper_hansen_

3 months

Claude 3.5 Sonnet is supposedly much better for agentic coding than previous Claude models. Who's going to re-evaluate aider, opendevin, swe-agent?

Tweet media one

5

9

84

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

SearchGPT is like a nice to have feature at this point. Nobody can take the crown from Perplexity after they introduced agentic search - it's simply too good and OpenAI cannot just steamroll them

23

3

82

@casper_hansen_

Casper Hansen

@casper_hansen_

28 days

The 405B was modified from 16 to 8 KV heads on HF. I wonder how much that mistake cost the open-source community 🤯 Not sure if AWQ needs requant or if it’s good as it is

4

5

75

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

It has been amazing to see the community help provide quantized models ever since TheBloke disappeared. To mention a few: solidrust, mistral-community, MaziyarPanahi, Qwen, Huggingface MultiModality team

Tweet media one

7

7

72

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

AutoAWQ now fuses MoE models. Mixtral now has the same inference speed as a 13B model. 2x 4090: 93 tokens/s 1x L40: 62 tokens/s (8x speedup🤯) Thanks to various contributors at DeepSeek, vLLM, and Alibaba for providing optimized open-sourced kernels.

0

5

67

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

The Sky voice was available from October 2023, so what's the problem? Even if the Sky voice emotes like SJ, i.e. being a bit sultry/flirty, it does not sound like her. The tonality is measurably different.

6

1

64

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

The makers of AWQ and SmoothQuant have released QServe, a state-of-the-art quantization serving library that surpasses vLLM and TensorRT-LLM by 1.4x-3.5x. This release includes in-flight batching, paged attention, and models quantized to W4-A8-KV4.

Tweet media one

3

8

62

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

AWQ models of Llama 3.1 are done and uploaded ✅ Should run in vLLM out of the box! Links below👇👇👇

3

4

63

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

AWQ paper is accepted to MLSys 2024 🥳. The AWQ authors have introduced new kernels: - mixes GEMV and GEMM for optimal performance (64% faster than EXLV2 at decoding at batch size 32 👀) - soon integrated into vLLM (PR 3289, pending exact throughput results)

@Shang_mit

Shang Yang

6 months

🚀 Unveil the newest advancement of TinyChat & AWQ: Visual Language Models (VLM) on the edge! 🌟Elevate image reasoning with TinyChat's integration of VILA (CVPR’24), which is effortlessly quantized with AWQ (recently accepted by MLSys’24! 🎉). #MLSys ’24 #CVPR ’24 #AWQ 📸✨(1/7)

7

17

59

0

8

62

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

Am I reading this correctly or is this intentionally misleading? They show you can train a Llama 2 7B model 29.1x cheaper in end-to-end energy cost. So that would be a reduction from 184320 to 6334 A100 hours.

Tweet media one

6

2

57

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

MI300x matching H100 performance is incredible. The big difference is 192GB vs 80GB VRAM. Over double the VRAM makes a huge difference for which models you can serve.

@runpod_io

RunPod

2 months

There's been a lot of hype around AMD's MI300X recently, but very few benchmarks have been made open-source. So, we ran a series of benchmarks on Mixtral 7x8B and compared AMD's performance against Nvidia's H100 SXM. TL;DR: - MI300X outperforms H100 SXM at both small and large

Tweet media one

5

25

120

6

2

56

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

Meta released their H2 plan for torchtune, their PyTorch finetuning library. In that plan, Meta describes supporting multiple Llama releases. This probably means we will get more than just a Llama 3 405B. We will likely get multimodal models and versions specific to agents.

Tweet media one

3

4

51

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

@Teknium1 they cracked this benchmark. not sure how. but there is no way it's actually that good. especially when looking at coding, it's 100% gamed

2

0

48

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

It seems Mistral will finally be sharing how to finetune their MoE models 👀 As many finetuners know, there have been issues getting the Mixtral model to finetune well ever since it was released, so I am excited to see Mistral enable us

Tweet media one

2

7

46

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

Why are companies still using need in a haystack to showcase their long context length? This particular benchmark is like 1/10 useful.

@Gradient_AI_

Gradient

4 months

We’re going back 2 back! 🔥 Introducing the first 1M context window @AIatMeta Llama-3 70B to pair with the our Llama-3 8B model that we launched last week on @huggingface . Our 1M context window 70B model landed a perfect score on NIAH and we’re excited about the results that

Tweet media one

40

112

659

15

0

44

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

Why is @MistralAI 's Large 2 not listed on @lmsysorg ? I just used it today and it was the only model that could refactor part of complex code into a separate function

4

0

44

@casper_hansen_

Casper Hansen

@casper_hansen_

8 months

These results (2.66x higher throughput) were unexpected at first. But it kind of makes sense given that vLLM uses blocked KV-cache, meaning it has to process context before it can continue decoding.

AWQ: Up to 2.66x higher throughput by casper-hansen · Pull Request #2566 · vllm-project/vllm

The strategy is to dequantize and run FP16 matmul for longer sequences. This could probably be faster if we just used cublas instead of torch.matmul. EDIT: It seems throughput can be over 2x in vLL...

3

9

43

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

@Teknium1 That coincides with ICML talk that @soumithchintala will be giving at 9am

Tweet media one

3

1

36

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

Llama 3 405B AWQ: ASCII chart over vLLM performance of the new kernels on 4x H100. We can graciously process so many tokens/second due to hyper optimized kernels from @neuralmagic .

Tweet media one

4

6

36

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

Llama 3 MMLU: - 70B: 82% (beating Gemini Pro 1.5, Claude 3 Sonnet) - 7B: 68.4% (beating Gemma 7B, Mistral 7B)

Tweet media one

5

3

33

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

@migtissera 8k version is up too

Tweet card media

apple/DCLM-7B-8k · Hugging Face

0

3

32

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

@ziv_ravid It's actually very simple. First you train the dense model, then you do sparse upcycling to get an MoE. So the next step for Mistral it to train 8x123B.

3

1

30

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

@JeremyNguyenPhD Part of me feels this should be normalized to the count of papers published

2

0

31

@casper_hansen_

Casper Hansen

@casper_hansen_

3 months

AutoAWQ now supports CPU inference (x86). This was directly added by Intel. Speed may vary – Intel pushed tokens/s to 18 on the Mistral 7B model on a highend CPU. A high clock speed and high memory bandwidth is key to high performance on CPUs.

Tweet media one

3

4

29

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

Mistral Large was just released along with Le Chat. MMLU: 81.2% (GPT-4: 86.4%). I think this is incredibly close and I bet I will find their models more useful for coding.

Tweet media one

5

3

26

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

It’s unfortunate that brilliant PRs like this that optimize quantized MoE models are dropped after 3 months of the author trying to keep updating it

Tweet card media

GPTQ & AWQ Fused MOE by chu-tianxiang · Pull Request #2761 · vllm-project/vllm

Thanks to the very smart MoE align strategy introduced in #2453, each block only uses a single expert, making it much easier to be adapted to quantized methods. This PR refactors the code to suppor...

3

1

29

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

it seems likely to me that - mistral was trained on 4T tokens based on - gemma was trained on 6T tokens - yi was trained on 3T tokens

Tweet media one

2

0

28

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

The new AWQ kernel from SqueezeBits "QUICK" demonstrates up to 1.94x higher throughput in vLLM

Tweet media one

1

0

28

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

This is the first good model from China that I have seen released under the Apache 2.0 license! Is this a mistake or the new way going forward? Yi team also smashed the evaluations, their models are very performant.

@_philschmid

Philipp Schmid

4 months

First Apache 2.0 release of YI! 😍 @01AI_Yi 1.5 is a continuously pre-trained version of YI 1.0 on 500B tokens to improve coding, reasoning, and instruction-following capability. It comes in 3 sizes, with the biggest Yi 1.5. 34B almost matching Meta Llama 3 70B on benchmark! 🤯

Tweet media one

12

65

253

3

0

27

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

Databricks released a model better than Mixtral. Highlights: - 73.7% MMLU, beats Mixtral, loses to Claude Haiku. - MoE with 132B params, 36B active. - Trained on 12T tokens.

Tweet media one

1

3

26

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

This is absolutely huge. The open-source world has needed this dataset for a long time and now it’s here!

@gui_penedo

Guilherme Penedo

5 months

We have just released 🍷 FineWeb: 15 trillion tokens of high quality web data. We filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!

Tweet media one

40

345

2K

0

2

26

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

Llama 3 8B Instruct - quantized with AWQ.

Tweet card media

casperhansen/llama-3-8b-instruct-awq · Hugging Face

2

2

27

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

Quantization makes incredible sense as models scale in parameters. DeepSeek V2 (236B) requires 2x80GB in 4 bits. Or a single AMD MI300X GPU with 192GB VRAM. @TensorWaveCloud and @runpod_io seem to be great options for AMD hardware as you can avoid running a full 8x A100.

1

3

26

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

@JustinLin610 The general problem is custom licenses because they need to be evaluated by legal departments. It would be easier if models adopted a fairly standard license like Apache 2.0 or MIT becuase they are generally recognized as safe/usable.

1

0

26

@casper_hansen_

Casper Hansen

@casper_hansen_

9 days

@huybery I think you need... a vision-language model... to decode this blurred image 😇

1

0

24

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

@francoisfleuret This one is recent: MIT licensed and beats GPT-4V

Tweet card media

BK-Lee/MoAI-7B · Hugging Face

0

0

23

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

I wonder how long it will be before someone cashes in the $1200 bounty on Mixtral by importing vLLM's MoE Triton kernels into axolotl (and implementing the backward pass). @jeremyphoward @winglian Any thoughts on what is needed here to run a backward pass?

2

2

21

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

@natolambert I agree with you here. Just exciting to see Apple out of everyone contributing :)

2

0

23

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

The new AWQ kernels are so much faster at decoding. This is a simple benchmark on an RTX 4090 by using AutoAWQ/transformers backend. It scales incredibly well - 64% faster at batch size 32 according to my numbers.

Tweet media one

3

1

22

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

Sharing for awareness: AWQ models may not have run with high accuracy on responses that used the marlin kernels due to a precision issue (which is now being fixed). For example: GSM8K went from 58% to 73% accuracy after the fix by neuralmagic.

Tweet media one

3

3

23

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

Realization: With AMD MI300's 5.2TB memory bandwidth/second, you can get Groq speed with a fully optimized engine. Now starts the time when AMD slowly eats the inference market share of Nvidia. Inference is already compatible with vLLM.

4

2

22

@casper_hansen_

Casper Hansen

@casper_hansen_

3 months

Releasing Llama 3 400B will be such a powermove from Meta business-wise. If the model is as good as advertised, which we have good reason to believe it is, then Meta will weaken OpenAI, Antrophic, Google, and Microsoft. I wonder if they release specialized quantization with it

5

4

22

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

Llama 3 in July, they had a year to train it. My expectations are that they beat GPT-4 but that we will have GPT-5 by that time.

3

0

22

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

@intrstllrninja You can also find this linked on the model page in the collection for DCLM

Tweet card media

mlfoundations/dclm-baseline-1.0-parquet · Datasets at Hugging Face

0

2

22

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

More detailed view on the speedup (2.84x if you have models with very long inputs 👀).

Tweet media one

2

3

21

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

@erhartford @OpenAI @Google @MistralAI @AIatMeta @xai @elonmusk Wouldn't it be embarrassing to be unable to drop a better 7B model than Mistral though? They would get absolutely slaughtered if they couldn't. One scenario is that GPT-5 drops, a small version of GPT 3.5 is open-sourced (not making them money anyway).

3

0

21

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

8B and 70B AWQ is out

4

4

20

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

After almost a year of working on quantizing large language models, I look back both proud and amazed that so much could happen during my time in the open-source world: - Google: Offers models quantized using AutoAWQ on Google Cloud. (1/2)

Tweet media one

2

1

20

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

Deduplication is sort of well-known by now. But where do we find implementations of this data pipeline, specifically before and after deduplication?

Tweet media one

2

3

19

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

Honestly amazing numbers for AWQ since the Llama 3.1 release. Seems most people are trying out the 70B, but a good number of people are also going for the 405B, which is impressive! 8B AWQ: 4676 downloads (28.7%) 70B AWQ: 9735 downloads (59.8%) 405B AWQ: 1872 downloads (11.5%)

1

2

19

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

Full evaluation of GPT-4o: 88.7% MMLU, 90.2% HumanEval, 53.6% GPQA Can you feel the steamroll?

Tweet media one

5

6

18

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

It's merged! Now we are pending release and I will put in a PR to update vLLM docs on how to properly use it

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

AWQ will soon run up to 2.59x faster in vLLM on large models thanks to @neuralmagic

Tweet media one

5

16

107

1

0

19

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

Most people don't know this about me... I used to blog about machine learning basics with millions of readers. Just like the nugget below, but vastly expanded with code and walkthroughs of exactly how to do it from scratch. Now that blogging is dead, X feels like a good medium

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

It's not widely understood how @MistralAI make their MoE models. Just putting this out there: - train dense model, e.g. 7B, 22B, 123B. - use sparse upcycling to create MoE from dense model - train MoE Benefits include stable training of the MoE.

Tweet media one

7

17

142

3

1

19

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

Insane statistics on Llama 3.1 downloads. Growing adoption of quantization is showing in the numbers! - 70B: 119k (24% were AWQ) - 405B: 29k (11.5% were AWQ) This is without subtracting anything related to training or testing, just raw numbers on instruct model.

Tweet media one

2

2

18

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

@lmsysorg i am highly skeptical that gpt-4o mini is actually this good for coding.

Tweet media one

0

0

18

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

@BasedBeffJezos The US is not ahead of China in AI. That’s an illusion that we like to believe. The reality is that China is beating the US in AI modeling every day of the week, but they focus on Chinese and not English.

7

3

17

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

AMD just acquired the largest AI lab in europe - . I worked with them before, it's essentially a consultancy. They trained an LLM - Viking 13B, a Scandinavian model trained on 4096 AMD MI250X.

Tweet card media

Europe’s largest private AI lab | Silo AI

Silo AI is a private AI lab on a mission to build a European flagship AI company building smart devices, autonomous vehicles, industry 4.0, and smart cities.

0

4

18

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

@Carnage4Life I mean Apple is kind of right. They are giving Spotify a full ecosystem and a lot of support, but see no monetary gain from it.

14

0

18

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

@erhartford Having tested both models, I feel like DBRX is sort of this unenthusiastic instructed model. It should have huge potential for finetuning to create a model that’s actually interesting though - let’s not forget it saw 12T tokens

1

1

18

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

Another Phi model that looks great on benchmarks but feels “meh”. I think a lot of people will be disappointed, especially after the claims of beating Llama 3, Mixtral, and others.

5

2

17

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

Over the last couple of months, many new features have been introduced to AutoAWQ 🚀 I am gearing up for a v0.2.0 release with the following features: - PEFT training - 60% faster prefilling for GEMM - ExLlamaV2 + support AMD - GGUF export

3

2

16

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

Why are people still using RoPE or PoSE when we have infini-attention?

4

2

16

@casper_hansen_

Casper Hansen

@casper_hansen_

5 years

Explaining Feedforward, Backpropagation and Optimization: The Math Explained Clearly with Visualizations.

1

5

13

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

AWQ Mistral Large Instruct (123B) is in the pipeline with compute support from @runpod_io . It will be automatically uploaded to HuggingFace once the quant is finished. Should be under 24 hours! 🤞

Tweet media one

2

0

17

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

Paper link: (should probably have added this from the start)

Tweet card media

Yi: Open Foundation Models by 01.AI

We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language...

0

3

15

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

Remember the Google "We Have No Moat, And Neither Does OpenAI" article? As it turns out, Google is the company with the real moat, compute power. As it turns out, Google is increasing its compute power with TPUs exponentially, whereas OpenAI is sublinear.

6

2

16

@casper_hansen_

Casper Hansen

@casper_hansen_

3 months

A new champion in SWE-Bench has entered the arena, slightly outscoring OpenDevin:

Tweet card media

How aider scored SOTA 26.3% on SWE Bench Lite

Aider achieved this result mainly through its existing features that focus on static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming.

2

1

16

@casper_hansen_

Casper Hansen

@casper_hansen_

7 months

A good system prompt is all you need

0

6

16

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

LMSys releases SGLang v2.0 and claim big speedups. - Incredibly efficient batch scheduler - Can continue to scale the throughput with larger batch sizes - Other systems cannot scale their throughput or batch sizes due to OOM, missing extensive manual tuning, or other overheads.

Tweet media one

Tweet media two

Tweet media three

Tweet media four

1

1

16

@casper_hansen_

Casper Hansen

@casper_hansen_

3 months

Karpathy's new LLM101n chapters include a chapter on quantization. Which algorithm do you think he will cover? Simply go for FP8?

Tweet media one

1

1

16

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

3/5 CodeAct is open-source: dataset, model, code, UI. Observe, Plan, Act, Iterate. CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions.

Tweet media one

1

2

15

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

Grok-1 is a significant release and should not be underestimated. Other companies will save millions of $ in pretraining since Grok is likely undertrained and needs continued pretraining.

1

1

14

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

@_philschmid @OpenAI Feels a little strange to only report MT-Bench

1

0

14

@casper_hansen_

Casper Hansen

@casper_hansen_

1 month

@kimmonismus Some details you missed on Quiet-STaR: - it has a DQN reward in it's loss function - Quiet-STaR finds the best generation path; you can think of each step in the generation path as a state - loss function is negative log-likelihood + REINFORCE (borrowing from STaR).

Tweet media one

1

1

15

@casper_hansen_

Casper Hansen

@casper_hansen_

2 months

So how is that supposed to work? What is going to prevent someone from downloading the weights from Europe?

@AndrewCurran_

Andrew Curran

2 months

From this: - Llama 4 started training in June - Llama 4 will be fully multimodal, including audio - Llama 3 405b will still release in the EU - Llama 4, and beyond, will not be released in the EU unless they allow META to use EU Facebook and Instagram posts as training data.

Tweet media one

37

139

863

3

0

15

@casper_hansen_

Casper Hansen

@casper_hansen_

8 months

Thanks to @erhartford , @ldjconfirmed and @jon_durbin for building true open source, each in a unique way. 📘 Finetuning is a crowded space. You guys make a huge difference with open sourced code, models, datasets, and weights.🟢

3

2

14

@casper_hansen_

Casper Hansen

@casper_hansen_

4 months

Almost 9k downloads in 9 days. Thanks to everyone for using AWQ for maximizing speed & quality!

Tweet media one

2

0

14

@casper_hansen_

Casper Hansen

@casper_hansen_

5 months

Each time Huggingface is down, it’s a reminder of how dependent we are on them. Huggingface is great but we need more decentralization that is plug and play like the HF ecosystem.

1

0

13

@casper_hansen_

Casper Hansen

@casper_hansen_

6 months

Claude 3 Opus is getting incredibly close PhD level of knowledge

@idavidrein

david rein

6 months

Claude 3 gets ~60% accuracy on GPQA. It's hard for me to understate how hard these questions are—literal PhDs (in different domains from the questions) with access to the internet get 34%. PhDs *in the same domain* (also with internet access!) get 65% - 75% accuracy.

Tweet media one

23

215

1K

2

1

14

@casper_hansen_

Casper Hansen

@casper_hansen_

5 years

@iamtrask Lesson learned, now I know how to write titles for my blogposts :)

0

0

13