Casper Hansen Profile
Casper Hansen

@casper_hansen_

2,688
Followers
210
Following
114
Media
1,401
Statuses

NLP Scientist | AutoAWQ Creator | Open-Source Contributor

Joined August 2019
Don't wanna be here? Send us removal request.
Pinned Tweet
@casper_hansen_
Casper Hansen
4 months
Hassabis: "Much of science can be summed up like this: Imagine all the knowledge in the world as a tree. Today's civilization knows only a tiny part of that tree. I think of AI as a tool that will enable scientists to one day explore the entire tree."
2
0
12
@casper_hansen_
Casper Hansen
2 months
Apple released a 7B model that beats Mistral 7B - but the kicker is that they fully open sourced everything, also the pretraining dataset 🤯
29
502
3K
@casper_hansen_
Casper Hansen
3 months
People on X: US is way ahead of China in AI. Me: China is way ahead of US, actually not even close. China: Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-38B
34
68
627
@casper_hansen_
Casper Hansen
7 months
Mistral secretly released a new model called "next". Only available on the @lmsysorg
Tweet media one
9
35
365
@casper_hansen_
Casper Hansen
6 months
Finetuning: 10k multi-turn conversations is all you need. The Yi paper explains that high-quality data beats quantity in fine-tuning and that you can beat other datasets that include 100-900k conversations just by focusing on quality in 10k conversations.
Tweet media one
13
49
359
@casper_hansen_
Casper Hansen
7 months
What if vLLM could be 3x faster? Soon to be a reality with FlashInfer! From the maintainer: "We are talking to the FlashInfer team and working on merging it with vLLM!"
Tweet media one
9
20
208
@casper_hansen_
Casper Hansen
5 months
The velocity of development on OpenDevin is incredible. In under a week, it has taken the throne from Devika in GitHub stars, and it also seems to be the most stable/usable of the two. Recommended you try it:
3
37
193
@casper_hansen_
Casper Hansen
7 months
AWQ now runs up to 2.66x faster in vLLM after my PR was merged. Thanks to @woosuk_k for a quick review and merge 🥳 I hope we can bring more of these optimizations to the community so we can run models on smaller budgets!🤗
Tweet media one
9
25
193
@casper_hansen_
Casper Hansen
2 months
vLLM 0.5.3 is out which means 1.56-2.59x speedup for AWQ models. What do you need to do? Nothing - it's automatically applied. Just pip install vllm==0.5.3.
Tweet media one
4
25
154
@casper_hansen_
Casper Hansen
1 month
It's not widely understood how @MistralAI make their MoE models. Just putting this out there: - train dense model, e.g. 7B, 22B, 123B. - use sparse upcycling to create MoE from dense model - train MoE Benefits include stable training of the MoE.
Tweet media one
@casper_hansen_
Casper Hansen
1 month
@ziv_ravid It's actually very simple. First you train the dense model, then you do sparse upcycling to get an MoE. So the next step for Mistral it to train 8x123B.
3
1
30
7
17
142
@casper_hansen_
Casper Hansen
5 months
This year is going to be amazing: 256K context total, 140K context in ONE GPU???? Close performance to Mixtral while having 3x higher throughput. Apache 2.0. What a release!
@AI21Labs
AI21 Labs
5 months
Introducing Jamba, our groundbreaking SSM-Transformer open model! As the first production-grade model based on Mamba architecture, Jamba achieves an unprecedented 3X throughput and fits 140K context on a single GPU. 🥂Meet Jamba 🔨Build on @huggingface
Tweet media one
37
252
1K
1
10
140
@casper_hansen_
Casper Hansen
5 months
I did some research on LLM as agents today. Here is a guide to the state-of-the-art of LLMs as agents! It's all about environments where LLMs can observe, plan, act, and iterate on solutions. 🧵1/8
Tweet media one
4
14
123
@casper_hansen_
Casper Hansen
2 months
I was coding yesterday with GPT-4o, Claude 3.5, and DeepSeek V2. Unsurprisingly, Claude won so grossly in my attempts to optimize code. GPT-4o and DeepSeek V2 is roughly on the same level for coding though. Very problem dependent on who wins.
2
6
110
@casper_hansen_
Casper Hansen
30 days
First training run with Quiet-STaR 🍓. Initially, I am going with 12 thought tokens and 4 talk tokens. I was able to squeeze in a batch size of 64 on 4x A40 48GB GPUs.
Tweet media one
3
9
108
@casper_hansen_
Casper Hansen
2 months
AWQ will soon run up to 2.59x faster in vLLM on large models thanks to @neuralmagic
Tweet media one
5
16
107
@casper_hansen_
Casper Hansen
1 month
I can't help but wonder if OpenAI's Q* thinks for every token or if the model learns when to think? Once you go into thinking mode with Quiet-STaR, it needs to generate up to 16 tokens before continuing.
Tweet media one
@apples_jimmy
Jimmy Apples 🍎/acc
1 month
Little birdies told me 🍓 / Q* hasn’t released yet as they aren’t happy with the latency and other little things they want to further optimise. Not sure if big patience or small patience. Either way, @ChatGPTapp Shhh patience 🤫 🤫 🤫
0
74
902
2
5
105
@casper_hansen_
Casper Hansen
5 months
The next generation of models seem to mostly target infinite context and adaptive compute per token. Basically, these two papers: Google: Mixture of Depths Google: Infini-Attention
5
11
93
@casper_hansen_
Casper Hansen
5 months
I highly recommend watching this 20 minute video on SWE-Agent where they go through a full problem and how it's solved using special tools.
2
24
91
@casper_hansen_
Casper Hansen
8 months
Since the summer of 2023, I have been working hard on AutoAWQ. You can now quantize Mixtral 8x7B, LLaVa, and other models and run inference through the transformers integration. 🧵 1/5
4
14
89
@casper_hansen_
Casper Hansen
1 month
Self-Taught Reasoner. Q*. Project Strawberry. Lots to talk about here. I have been thinking about hacking together a real solution for this. Both training with Huggingface libraries and inference in vLLM. What should one name such a project? OpenSTaR?
6
6
87
@casper_hansen_
Casper Hansen
3 months
Claude 3.5 Sonnet is supposedly much better for agentic coding than previous Claude models. Who's going to re-evaluate aider, opendevin, swe-agent?
Tweet media one
5
9
84
@casper_hansen_
Casper Hansen
1 month
SearchGPT is like a nice to have feature at this point. Nobody can take the crown from Perplexity after they introduced agentic search - it's simply too good and OpenAI cannot just steamroll them
23
3
82
@casper_hansen_
Casper Hansen
28 days
The 405B was modified from 16 to 8 KV heads on HF. I wonder how much that mistake cost the open-source community 🤯 Not sure if AWQ needs requant or if it’s good as it is
4
5
75
@casper_hansen_
Casper Hansen
5 months
It has been amazing to see the community help provide quantized models ever since TheBloke disappeared. To mention a few: solidrust, mistral-community, MaziyarPanahi, Qwen, Huggingface MultiModality team
Tweet media one
7
7
72
@casper_hansen_
Casper Hansen
7 months
AutoAWQ now fuses MoE models. Mixtral now has the same inference speed as a 13B model. 2x 4090: 93 tokens/s 1x L40: 62 tokens/s (8x speedup🤯) Thanks to various contributors at DeepSeek, vLLM, and Alibaba for providing optimized open-sourced kernels.
0
5
67
@casper_hansen_
Casper Hansen
4 months
The Sky voice was available from October 2023, so what's the problem? Even if the Sky voice emotes like SJ, i.e. being a bit sultry/flirty, it does not sound like her. The tonality is measurably different.
6
1
64
@casper_hansen_
Casper Hansen
4 months
The makers of AWQ and SmoothQuant have released QServe, a state-of-the-art quantization serving library that surpasses vLLM and TensorRT-LLM by 1.4x-3.5x. This release includes in-flight batching, paged attention, and models quantized to W4-A8-KV4.
Tweet media one
3
8
62
@casper_hansen_
Casper Hansen
2 months
AWQ models of Llama 3.1 are done and uploaded ✅ Should run in vLLM out of the box! Links below👇👇👇
3
4
63
@casper_hansen_
Casper Hansen
6 months
AWQ paper is accepted to MLSys 2024 🥳. The AWQ authors have introduced new kernels: - mixes GEMV and GEMM for optimal performance (64% faster than EXLV2 at decoding at batch size 32 👀) - soon integrated into vLLM (PR 3289, pending exact throughput results)
@Shang_mit
Shang Yang
6 months
🚀 Unveil the newest advancement of TinyChat & AWQ: Visual Language Models (VLM) on the edge! 🌟Elevate image reasoning with TinyChat's integration of VILA (CVPR’24), which is effortlessly quantized with AWQ (recently accepted by MLSys’24! 🎉). #MLSys ’24 #CVPR ’24 #AWQ 📸✨(1/7)
7
17
59
0
8
62
@casper_hansen_
Casper Hansen
6 months
Am I reading this correctly or is this intentionally misleading? They show you can train a Llama 2 7B model 29.1x cheaper in end-to-end energy cost. So that would be a reduction from 184320 to 6334 A100 hours.
Tweet media one
6
2
57
@casper_hansen_
Casper Hansen
2 months
MI300x matching H100 performance is incredible. The big difference is 192GB vs 80GB VRAM. Over double the VRAM makes a huge difference for which models you can serve.
@runpod_io
RunPod
2 months
There's been a lot of hype around AMD's MI300X recently, but very few benchmarks have been made open-source. So, we ran a series of benchmarks on Mixtral 7x8B and compared AMD's performance against Nvidia's H100 SXM. TL;DR: - MI300X outperforms H100 SXM at both small and large
Tweet media one
5
25
120
6
2
56
@casper_hansen_
Casper Hansen
2 months
Meta released their H2 plan for torchtune, their PyTorch finetuning library. In that plan, Meta describes supporting multiple Llama releases. This probably means we will get more than just a Llama 3 405B. We will likely get multimodal models and versions specific to agents.
Tweet media one
3
4
51
@casper_hansen_
Casper Hansen
2 months
@Teknium1 they cracked this benchmark. not sure how. but there is no way it's actually that good. especially when looking at coding, it's 100% gamed
2
0
48
@casper_hansen_
Casper Hansen
5 months
It seems Mistral will finally be sharing how to finetune their MoE models 👀 As many finetuners know, there have been issues getting the Mixtral model to finetune well ever since it was released, so I am excited to see Mistral enable us
Tweet media one
2
7
46
@casper_hansen_
Casper Hansen
4 months
Why are companies still using need in a haystack to showcase their long context length? This particular benchmark is like 1/10 useful.
@Gradient_AI_
Gradient
4 months
We’re going back 2 back! 🔥 Introducing the first 1M context window @AIatMeta Llama-3 70B to pair with the our Llama-3 8B model that we launched last week on @huggingface . Our 1M context window 70B model landed a perfect score on NIAH and we’re excited about the results that
Tweet media one
40
112
659
15
0
44
@casper_hansen_
Casper Hansen
1 month
Why is @MistralAI 's Large 2 not listed on @lmsysorg ? I just used it today and it was the only model that could refactor part of complex code into a separate function
4
0
44
@casper_hansen_
Casper Hansen
8 months
These results (2.66x higher throughput) were unexpected at first. But it kind of makes sense given that vLLM uses blocked KV-cache, meaning it has to process context before it can continue decoding.
3
9
43
@casper_hansen_
Casper Hansen
2 months
@Teknium1 That coincides with ICML talk that @soumithchintala will be giving at 9am
Tweet media one
3
1
36
@casper_hansen_
Casper Hansen
2 months
Llama 3 405B AWQ: ASCII chart over vLLM performance of the new kernels on 4x H100. We can graciously process so many tokens/second due to hyper optimized kernels from @neuralmagic .
Tweet media one
4
6
36
@casper_hansen_
Casper Hansen
5 months
Llama 3 MMLU: - 70B: 82% (beating Gemini Pro 1.5, Claude 3 Sonnet) - 7B: 68.4% (beating Gemma 7B, Mistral 7B)
Tweet media one
5
3
33
@casper_hansen_
Casper Hansen
1 month
@ziv_ravid It's actually very simple. First you train the dense model, then you do sparse upcycling to get an MoE. So the next step for Mistral it to train 8x123B.
3
1
30
@casper_hansen_
Casper Hansen
5 months
@JeremyNguyenPhD Part of me feels this should be normalized to the count of papers published
2
0
31
@casper_hansen_
Casper Hansen
3 months
AutoAWQ now supports CPU inference (x86). This was directly added by Intel. Speed may vary – Intel pushed tokens/s to 18 on the Mistral 7B model on a highend CPU. A high clock speed and high memory bandwidth is key to high performance on CPUs.
Tweet media one
3
4
29
@casper_hansen_
Casper Hansen
6 months
Mistral Large was just released along with Le Chat. MMLU: 81.2% (GPT-4: 86.4%). I think this is incredibly close and I bet I will find their models more useful for coding.
Tweet media one
5
3
26
@casper_hansen_
Casper Hansen
6 months
it seems likely to me that - mistral was trained on 4T tokens based on - gemma was trained on 6T tokens - yi was trained on 3T tokens
Tweet media one
2
0
28
@casper_hansen_
Casper Hansen
7 months
The new AWQ kernel from SqueezeBits "QUICK" demonstrates up to 1.94x higher throughput in vLLM
Tweet media one
1
0
28
@casper_hansen_
Casper Hansen
4 months
This is the first good model from China that I have seen released under the Apache 2.0 license! Is this a mistake or the new way going forward? Yi team also smashed the evaluations, their models are very performant.
@_philschmid
Philipp Schmid
4 months
First Apache 2.0 release of YI! 😍 @01AI_Yi 1.5 is a continuously pre-trained version of YI 1.0 on 500B tokens to improve coding, reasoning, and instruction-following capability. It comes in 3 sizes, with the biggest Yi 1.5. 34B almost matching Meta Llama 3 70B on benchmark! 🤯
Tweet media one
12
65
253
3
0
27
@casper_hansen_
Casper Hansen
5 months
Databricks released a model better than Mixtral. Highlights: - 73.7% MMLU, beats Mixtral, loses to Claude Haiku. - MoE with 132B params, 36B active. - Trained on 12T tokens.
Tweet media one
1
3
26
@casper_hansen_
Casper Hansen
5 months
This is absolutely huge. The open-source world has needed this dataset for a long time and now it’s here!
@gui_penedo
Guilherme Penedo
5 months
We have just released 🍷 FineWeb: 15 trillion tokens of high quality web data. We filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!
Tweet media one
40
345
2K
0
2
26
@casper_hansen_
Casper Hansen
5 months
Llama 3 8B Instruct - quantized with AWQ.
2
2
27
@casper_hansen_
Casper Hansen
2 months
Quantization makes incredible sense as models scale in parameters. DeepSeek V2 (236B) requires 2x80GB in 4 bits. Or a single AMD MI300X GPU with 192GB VRAM. @TensorWaveCloud and @runpod_io seem to be great options for AMD hardware as you can avoid running a full 8x A100.
1
3
26
@casper_hansen_
Casper Hansen
4 months
@JustinLin610 The general problem is custom licenses because they need to be evaluated by legal departments. It would be easier if models adopted a fairly standard license like Apache 2.0 or MIT becuase they are generally recognized as safe/usable.
1
0
26
@casper_hansen_
Casper Hansen
9 days
@huybery I think you need... a vision-language model... to decode this blurred image 😇
1
0
24
@casper_hansen_
Casper Hansen
6 months
@francoisfleuret This one is recent: MIT licensed and beats GPT-4V
0
0
23
@casper_hansen_
Casper Hansen
6 months
I wonder how long it will be before someone cashes in the $1200 bounty on Mixtral by importing vLLM's MoE Triton kernels into axolotl (and implementing the backward pass). @jeremyphoward @winglian Any thoughts on what is needed here to run a backward pass?
2
2
21
@casper_hansen_
Casper Hansen
2 months
@natolambert I agree with you here. Just exciting to see Apple out of everyone contributing :)
2
0
23
@casper_hansen_
Casper Hansen
5 months
The new AWQ kernels are so much faster at decoding. This is a simple benchmark on an RTX 4090 by using AutoAWQ/transformers backend. It scales incredibly well - 64% faster at batch size 32 according to my numbers.
Tweet media one
3
1
22
@casper_hansen_
Casper Hansen
1 month
Sharing for awareness: AWQ models may not have run with high accuracy on responses that used the marlin kernels due to a precision issue (which is now being fixed). For example: GSM8K went from 58% to 73% accuracy after the fix by neuralmagic.
Tweet media one
3
3
23
@casper_hansen_
Casper Hansen
7 months
Realization: With AMD MI300's 5.2TB memory bandwidth/second, you can get Groq speed with a fully optimized engine. Now starts the time when AMD slowly eats the inference market share of Nvidia. Inference is already compatible with vLLM.
4
2
22
@casper_hansen_
Casper Hansen
3 months
Releasing Llama 3 400B will be such a powermove from Meta business-wise. If the model is as good as advertised, which we have good reason to believe it is, then Meta will weaken OpenAI, Antrophic, Google, and Microsoft. I wonder if they release specialized quantization with it
5
4
22
@casper_hansen_
Casper Hansen
6 months
Llama 3 in July, they had a year to train it. My expectations are that they beat GPT-4 but that we will have GPT-5 by that time.
3
0
22
@casper_hansen_
Casper Hansen
2 months
@intrstllrninja You can also find this linked on the model page in the collection for DCLM
0
2
22
@casper_hansen_
Casper Hansen
7 months
More detailed view on the speedup (2.84x if you have models with very long inputs 👀).
Tweet media one
2
3
21
@casper_hansen_
Casper Hansen
7 months
@erhartford @OpenAI @Google @MistralAI @AIatMeta @xai @elonmusk Wouldn't it be embarrassing to be unable to drop a better 7B model than Mistral though? They would get absolutely slaughtered if they couldn't. One scenario is that GPT-5 drops, a small version of GPT 3.5 is open-sourced (not making them money anyway).
3
0
21
@casper_hansen_
Casper Hansen
5 months
8B and 70B AWQ is out
4
4
20
@casper_hansen_
Casper Hansen
2 months
After almost a year of working on quantizing large language models, I look back both proud and amazed that so much could happen during my time in the open-source world: - Google: Offers models quantized using AutoAWQ on Google Cloud. (1/2)
Tweet media one
2
1
20
@casper_hansen_
Casper Hansen
6 months
Deduplication is sort of well-known by now. But where do we find implementations of this data pipeline, specifically before and after deduplication?
Tweet media one
2
3
19
@casper_hansen_
Casper Hansen
1 month
Honestly amazing numbers for AWQ since the Llama 3.1 release. Seems most people are trying out the 70B, but a good number of people are also going for the 405B, which is impressive! 8B AWQ: 4676 downloads (28.7%) 70B AWQ: 9735 downloads (59.8%) 405B AWQ: 1872 downloads (11.5%)
1
2
19
@casper_hansen_
Casper Hansen
4 months
Full evaluation of GPT-4o: 88.7% MMLU, 90.2% HumanEval, 53.6% GPQA Can you feel the steamroll?
Tweet media one
5
6
18
@casper_hansen_
Casper Hansen
2 months
It's merged! Now we are pending release and I will put in a PR to update vLLM docs on how to properly use it
@casper_hansen_
Casper Hansen
2 months
AWQ will soon run up to 2.59x faster in vLLM on large models thanks to @neuralmagic
Tweet media one
5
16
107
1
0
19
@casper_hansen_
Casper Hansen
1 month
Most people don't know this about me... I used to blog about machine learning basics with millions of readers. Just like the nugget below, but vastly expanded with code and walkthroughs of exactly how to do it from scratch. Now that blogging is dead, X feels like a good medium
@casper_hansen_
Casper Hansen
1 month
It's not widely understood how @MistralAI make their MoE models. Just putting this out there: - train dense model, e.g. 7B, 22B, 123B. - use sparse upcycling to create MoE from dense model - train MoE Benefits include stable training of the MoE.
Tweet media one
7
17
142
3
1
19
@casper_hansen_
Casper Hansen
1 month
Insane statistics on Llama 3.1 downloads. Growing adoption of quantization is showing in the numbers! - 70B: 119k (24% were AWQ) - 405B: 29k (11.5% were AWQ) This is without subtracting anything related to training or testing, just raw numbers on instruct model.
Tweet media one
2
2
18
@casper_hansen_
Casper Hansen
2 months
@lmsysorg i am highly skeptical that gpt-4o mini is actually this good for coding.
Tweet media one
0
0
18
@casper_hansen_
Casper Hansen
4 months
@BasedBeffJezos The US is not ahead of China in AI. That’s an illusion that we like to believe. The reality is that China is beating the US in AI modeling every day of the week, but they focus on Chinese and not English.
7
3
17
@casper_hansen_
Casper Hansen
2 months
AMD just acquired the largest AI lab in europe - . I worked with them before, it's essentially a consultancy. They trained an LLM - Viking 13B, a Scandinavian model trained on 4096 AMD MI250X.
0
4
18
@casper_hansen_
Casper Hansen
6 months
@Carnage4Life I mean Apple is kind of right. They are giving Spotify a full ecosystem and a lot of support, but see no monetary gain from it.
14
0
18
@casper_hansen_
Casper Hansen
5 months
@erhartford Having tested both models, I feel like DBRX is sort of this unenthusiastic instructed model. It should have huge potential for finetuning to create a model that’s actually interesting though - let’s not forget it saw 12T tokens
1
1
18
@casper_hansen_
Casper Hansen
5 months
Another Phi model that looks great on benchmarks but feels “meh”. I think a lot of people will be disappointed, especially after the claims of beating Llama 3, Mixtral, and others.
5
2
17
@casper_hansen_
Casper Hansen
7 months
Over the last couple of months, many new features have been introduced to AutoAWQ 🚀 I am gearing up for a v0.2.0 release with the following features: - PEFT training - 60% faster prefilling for GEMM - ExLlamaV2 + support AMD - GGUF export
3
2
16
@casper_hansen_
Casper Hansen
4 months
Why are people still using RoPE or PoSE when we have infini-attention?
4
2
16
@casper_hansen_
Casper Hansen
5 years
Explaining Feedforward, Backpropagation and Optimization: The Math Explained Clearly with Visualizations.
1
5
13
@casper_hansen_
Casper Hansen
1 month
AWQ Mistral Large Instruct (123B) is in the pipeline with compute support from @runpod_io . It will be automatically uploaded to HuggingFace once the quant is finished. Should be under 24 hours! 🤞
Tweet media one
2
0
17
@casper_hansen_
Casper Hansen
6 months
Remember the Google "We Have No Moat, And Neither Does OpenAI" article? As it turns out, Google is the company with the real moat, compute power. As it turns out, Google is increasing its compute power with TPUs exponentially, whereas OpenAI is sublinear.
6
2
16
@casper_hansen_
Casper Hansen
7 months
A good system prompt is all you need
0
6
16
@casper_hansen_
Casper Hansen
1 month
LMSys releases SGLang v2.0 and claim big speedups. - Incredibly efficient batch scheduler - Can continue to scale the throughput with larger batch sizes - Other systems cannot scale their throughput or batch sizes due to OOM, missing extensive manual tuning, or other overheads.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
1
16
@casper_hansen_
Casper Hansen
3 months
Karpathy's new LLM101n chapters include a chapter on quantization. Which algorithm do you think he will cover? Simply go for FP8?
Tweet media one
1
1
16
@casper_hansen_
Casper Hansen
5 months
3/5 CodeAct is open-source: dataset, model, code, UI. Observe, Plan, Act, Iterate. CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions.
Tweet media one
1
2
15
@casper_hansen_
Casper Hansen
6 months
Grok-1 is a significant release and should not be underestimated. Other companies will save millions of $ in pretraining since Grok is likely undertrained and needs continued pretraining.
1
1
14
@casper_hansen_
Casper Hansen
5 months
@_philschmid @OpenAI Feels a little strange to only report MT-Bench
1
0
14
@casper_hansen_
Casper Hansen
1 month
@kimmonismus Some details you missed on Quiet-STaR: - it has a DQN reward in it's loss function - Quiet-STaR finds the best generation path; you can think of each step in the generation path as a state - loss function is negative log-likelihood + REINFORCE (borrowing from STaR).
Tweet media one
1
1
15
@casper_hansen_
Casper Hansen
2 months
So how is that supposed to work? What is going to prevent someone from downloading the weights from Europe?
@AndrewCurran_
Andrew Curran
2 months
From this: - Llama 4 started training in June - Llama 4 will be fully multimodal, including audio - Llama 3 405b will still release in the EU - Llama 4, and beyond, will not be released in the EU unless they allow META to use EU Facebook and Instagram posts as training data.
Tweet media one
37
139
863
3
0
15
@casper_hansen_
Casper Hansen
8 months
Thanks to @erhartford , @ldjconfirmed and @jon_durbin for building true open source, each in a unique way. 📘 Finetuning is a crowded space. You guys make a huge difference with open sourced code, models, datasets, and weights.🟢
3
2
14
@casper_hansen_
Casper Hansen
4 months
Almost 9k downloads in 9 days. Thanks to everyone for using AWQ for maximizing speed & quality!
Tweet media one
2
0
14
@casper_hansen_
Casper Hansen
5 months
Each time Huggingface is down, it’s a reminder of how dependent we are on them. Huggingface is great but we need more decentralization that is plug and play like the HF ecosystem.
1
0
13
@casper_hansen_
Casper Hansen
6 months
Claude 3 Opus is getting incredibly close PhD level of knowledge
@idavidrein
david rein
6 months
Claude 3 gets ~60% accuracy on GPQA. It's hard for me to understate how hard these questions are—literal PhDs (in different domains from the questions) with access to the internet get 34%. PhDs *in the same domain* (also with internet access!) get 65% - 75% accuracy.
Tweet media one
23
215
1K
2
1
14
@casper_hansen_
Casper Hansen
5 years
@iamtrask Lesson learned, now I know how to write titles for my blogposts :)
0
0
13