Woosuk Kwon Profile
Woosuk Kwon

@woosuk_k

2,694
Followers
442
Following
2
Media
148
Statuses

PhD student at @Berkeley_EECS building @vllm_project

Joined April 2023
Don't wanna be here? Send us removal request.
@woosuk_k
Woosuk Kwon
1 month
Developing @vllm_project taught me a tough lesson: to keep the GPU fully utilized, we need to pay close attention to everything happening on the CPU. Over the past month, the vLLM community conducted an in-depth study and made key optimizations, leading to significant
@vllm_project
vLLM
1 month
A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves 🚀2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s.
14
69
379
8
24
253
@woosuk_k
Woosuk Kwon
11 months
We’ve just released a new blog post comparing vLLM with DeepSpeed-FastGen. While we are happy to see the open-source technology advancements from the DeepSpeed team, we’ve got different results with more extensive performance benchmarks. vLLM is actually faster than DeepSpeed in
Tweet media one
3
30
208
@woosuk_k
Woosuk Kwon
1 year
Exciting news! 🎉Our PagedAttention paper is now up on arXiv! Dive in to learn why it's an indispensable technique for all major LLM serving frameworks. @zhuohan123 and I will present it at @sospconf next month. Blog post: Paper:
2
34
186
@woosuk_k
Woosuk Kwon
2 months
This Wednesday (8/21) I will be speaking about the diverse hardware support in vLLM, with a focus on AMD GPUs and Google TPUs. Sign up to learn more about vLLM!
Tweet media one
2
18
183
@woosuk_k
Woosuk Kwon
4 months
Gemma 2 is also available in vLLM! 🎉 Check out the update in the main branch and stay tuned for the next release coming soon
@GoogleDeepMind
Google DeepMind
4 months
We're excited to unveil Gemma 2. 🛠️ Available in both 9B and 27B parameters, it delivers the best performance for its size - unlocking more possibilities for developers to build and deploy with AI. →
30
222
1K
3
23
139
@woosuk_k
Woosuk Kwon
10 months
In vLLM v0.2.6, we've introduced CUDA/HIP graph for faster model execution, and added GPTQ support (finally!). More optimizations and feature are coming... so stay tuned!
2
17
125
@woosuk_k
Woosuk Kwon
10 months
vLLM + AMD MI300X = Blazingly-fast LLM serving! 🚀🚀🚀
@AMD
AMD
10 months
Update: Let's look at some new inference performance data on AMD Instinct MI300X
8
50
249
0
8
84
@woosuk_k
Woosuk Kwon
10 months
Check out the Mistral's official inference code at vLLM!
@zhuohan123
Zhuohan Li
10 months
Excited to have first-hand official support of the Mixtral MoE model in vLLM from @MistralAI ! Getting started with Mixtral with the latest vLLM now: . Be sure to check their announcing blog: Joint with @woosuk_k @PierreStock
0
5
73
4
4
66
@woosuk_k
Woosuk Kwon
8 months
The new vLLM release includes some optimizations for Gemma and Mixtral, and finally supports 8-bit GPTQ. Please give it a try!
@simon_mo_
Simon Mo
8 months
vLLM v0.3.3 is released with Starcoder2 @BigCodeProject and Inferentia @awscloud support. I'm also excited about the addition of guided decoding* (JSON, regex) in server leveraging @OutlinesOSS . *experimental, the schema take some time to compile but will be cached.
Tweet media one
2
24
92
0
8
66
@woosuk_k
Woosuk Kwon
3 months
We've partnered with @AIatMeta to support the 405B model from Day 1. Enjoy!
@vllm_project
vLLM
3 months
🚀 Exciting news! In partnership with @AIatMeta , vLLM officially supports Llama 3.1! 🦙✨ For Llama 3.1 405B, vLLM supports FP8 quantization on single machine and pipeline parallelism for multi-node serving. Learn more in our latest blog post:
1
18
84
0
2
48
@woosuk_k
Woosuk Kwon
3 months
Linux Foundation is home to many important open source projects like Linux and PyTorch. Today we are excited to announce that @vllm_project is joining @LFAIDataFdn , an AI-focused sub-foundation under the Linux Foundation. vLLM will keep open and trusted!
@vllm_project
vLLM
3 months
Two exciting updates! * vLLM is already widely adopted, and we want to ensure it has open governance and longevity. We are starting to join @LFAIDataFdn ! * We are doubling down in performance. Please checkout our roadmap.
0
14
88
3
4
44
@woosuk_k
Woosuk Kwon
7 months
We finally made a Twitter account for vLLM @vllm_project ! Please follow this account for the latest updates on vLLM!
@cdnamz
Cade Daniel 🇺🇸
7 months
vLLM made a Twitter! Go give them a follow @vllm_project And fun vLLM meetup btw! Thanks for hosting @Roblox
3
8
30
2
5
40
@woosuk_k
Woosuk Kwon
10 months
We've just released v0.2.5 which includes this performance improvement (contributed by Antoni at @anyscalecompute ). Please try it out!
@mattshumer_
Matt Shumer
10 months
Looks like Mixtral on vLLM is about to get a LOT faster
5
22
226
0
4
38
@woosuk_k
Woosuk Kwon
10 months
Great news! Phi-2 also works with vLLM and greatly benefits from our newest integration with CUDA graphs. Give it a try on vLLM!
@ClementDelangue
clem 🤗
10 months
Phi-2 by @MicrosoftAI is now the #1 trending model on @huggingface (). 2024 will be the year of smoll AI models!
Tweet media one
12
63
429
0
3
34
@woosuk_k
Woosuk Kwon
2 months
Always happy to see this kind of open source release. Thanks for the huge contributions to the community!
@hsu_byron
Byron Hsu
2 months
(1/n) Training LLMs can be hindered by out-of-memory, scaling batch size, and seq length. Add one line to boost multi-GPU training throughput by 20% and reduce memory usage by 60%. Introducing Liger-Kernel: Efficient Triton Kernels for LLM Training.
Tweet media one
21
170
972
2
1
30
@woosuk_k
Woosuk Kwon
9 months
Thanks for your continuous contribution to vLLM! Really happy to have you in our community :)
@casper_hansen_
Casper Hansen
9 months
AWQ now runs up to 2.66x faster in vLLM after my PR was merged. Thanks to @woosuk_k for a quick review and merge 🥳 I hope we can bring more of these optimizations to the community so we can run models on smaller budgets!🤗
Tweet media one
9
25
193
0
3
29
@woosuk_k
Woosuk Kwon
6 months
We are still actively looking for new reviewers! Please feel free to join us!
@vllm_project
vLLM
6 months
We are doubling our committer base for vLLM to ensure it is best-in-class and a truly community effort. This is just a start. Let's welcome @KaichaoYou , @pcmoritz , @nickhill33 , @rogerw0108 , @cdnamz , @robertshaw21 as committers and thank you for your great work! 👏
Tweet media one
2
4
32
0
0
27
@woosuk_k
Woosuk Kwon
1 year
Please come see us at Ray Summit! @zhuohan123 and I will talk about vLLM, a state-of-the-art open-source LLM inference engine, which actually uses #Ray for distributed inference. Don't miss it!
@robertnishihara
Robert Nishihara
1 year
Ray Summit this month will be 🔥🔥 🤯 ChatGPT creator @johnschulman2 🧙‍♀️ @bhorowitz on the AI landscape 🦹‍♂️ @hwchase17 on LangChain 🧑‍🚀 @jerryjliu0 on LlamaIndex 👨‍🎤 @zhuohan123 and @woosuk_k on vLLM 🧜 @zongheng_yang on SkyPilot 🧑‍🔧 @MetaAI on Llama-2 🧚‍♂️ @Adobe on Generative AI in
8
45
207
1
1
24
@woosuk_k
Woosuk Kwon
9 months
Super cool! This seems like the ultimate library everyone has wanted to have. Great work @ye_combinator !!
@ye_combinator
Zihao Ye
9 months
(1/4) Announcing FlashInfer, a kernel library that provides state-of-the-art kernel implementations for LLM Inference/Serving. FlashInfer's unique features include: - Comprehensive Attention Kernels: covering prefill/decode/append attention for various KV-Cache formats (Page
Tweet media one
2
82
298
2
0
21
@woosuk_k
Woosuk Kwon
1 year
Make LLM serving easy and fast with vLLM!
@zhuohan123
Zhuohan Li
1 year
🌟 Thrilled to introduce vLLM with @woosuk_k ! 🚀 vLLM is an open-source LLM inference and serving library that accelerates HuggingFace Transformers by 24x and powers @lmsysorg Vicuna and Chatbot Arena. Github: Blog:
20
254
1K
0
5
22
@woosuk_k
Woosuk Kwon
1 month
The vLLM meetup with NVIDIA is happening now! Join us and learn about our latest updates!
@NVIDIAAIDev
NVIDIA AI Developer
2 months
Dive into the latest in #AI with expert talks, updates, and networking. Join us for the @vllm_project & NVIDIA Triton User Meetup. 📅 Mon, Sept 9, 4-9 PM 📍Gallery 308, Fort Mason, SF. Secure your spot now ➡️ ✨
Tweet media one
1
11
49
0
0
18
@woosuk_k
Woosuk Kwon
7 months
Our 3rd meetup is happening on April 2nd. Come join us!
@simon_mo_
Simon Mo
7 months
The vLLM Team is excited to announce our Third vLLM Meetup in San Carlos on April 2nd (Tuesday). We will be discussing feature updates and hear from you! We thank @Roblox for hosting the event!
0
7
27
2
1
18
@woosuk_k
Woosuk Kwon
5 months
Join us for our 4th meetup on June 11th. We hope to see you there!
@vllm_project
vLLM
5 months
We are holding the 4th vLLM meetup at @Cloudflare with @bentomlai on June 11. Join us to discuss what's next in production LLM serving! Register at
0
8
25
1
2
17
@woosuk_k
Woosuk Kwon
1 year
Excited to see that our paged attention algorithm is adopted! Looking forward to it!
@NVIDIAAIDev
NVIDIA AI Developer
1 year
Just announced - NVIDIA TensorRT-LLM supercharges large language model #inference on NVIDIA H100 Tensor Core GPUs. #LLM
15
48
156
0
3
16
@woosuk_k
Woosuk Kwon
1 year
Getting and managing GPUs on cloud has grown increasingly challenging nowadays. Check out our latest blog post with @skypilot_org to discover how SkyPilot is making the development and deployment of vLLM easier!
@skypilot_org
SkyPilot
1 year
UC Berkeley's vLLM + SkyPilot speeds up LLM serving by 24x 🤩 Our user blog post on how SkyPilot combated GPU availability for #vLLM , allowing them to focus on AI and not infra. (Also includes a 1-click guide  to run it on your own cloud account!)
1
18
80
0
0
16
@woosuk_k
Woosuk Kwon
2 months
The next vLLM Meetup will be hosted with NVIDIA! Please join us on September 9th!
@vllm_project
vLLM
2 months
Join us on Monday, September 9 at Fort Mason to discuss recent performance enhancements in vLLM. This is a collaboration with NVIDIA Triton team.
0
4
28
0
0
15
@woosuk_k
Woosuk Kwon
1 year
Please join our first meetup and share your experience with vLLM!
0
1
13
@woosuk_k
Woosuk Kwon
4 months
Check out the improved multi-modality support in @vllm_project by @rogerw0108 !
@rogerw0108
Roger Wang
4 months
More exciting news around multi-modality in the upcoming v0.5.1 @vllm_project release! With a much simpler interface, vLLM will now support dynamic image embedding shapes for models such as LLaVA-NeXT and Phi-3-Vision!
Tweet media one
4
12
51
0
1
13
@woosuk_k
Woosuk Kwon
1 month
Try out Pixtral with vLLM!
@vllm_project
vLLM
1 month
🖼️ pip install -U vLLM vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=4' --max_num_batched_tokens 16384
5
33
320
0
0
13
@woosuk_k
Woosuk Kwon
1 year
Check out this wonderful article from @anyscalecompute ! It shows vLLM delivers 23x speedup on OPT-13B!
@cdnamz
Cade Daniel 🇺🇸
1 year
I wrote about a 23x improvement (!) in LLM live-inference throughput, measured on OPT-13B on A100. There are 2 new innovations which make this possible: Continuous batching & PagedAttention. Short thread below; see writeup, experiments, and results at
Tweet media one
2
52
245
0
2
13
@woosuk_k
Woosuk Kwon
3 months
We will have the 5th vLLM meetup with AWS next Wednesday! Please join us and learn more about our recent progress!
@vllm_project
vLLM
3 months
We are excited to invite everyone to our 5th meetup with AWS on July 24 in SF (next Wed!). The team will share recent progress in FP8, pipeline parallel, and various work on perf. The space is limited to 150 so plz register ASAP!
1
8
18
0
1
13
@woosuk_k
Woosuk Kwon
6 months
Congrats!!!
@Michaelvll1
Zhanghao Wu
6 months
I am honored to share that our recent paper won the Outstanding Paper Award in NSDI’24! The paper explores the policy design of our SkyPilot managed spot for @skypilot_org : Can’t Be Late: Optimizing Spot Instance Savings under Deadlines It would not be possible, if it were not
Tweet media one
10
6
90
0
0
13
@woosuk_k
Woosuk Kwon
9 months
Please check out our second meetup on Jan 31st!!
@simon_mo_
Simon Mo
9 months
We are hosting The Second vLLM Meetup in downtown SF on Jan 31st (Wed). Come to chat with vLLM maintainers about LLMs in production and inference optimizations! Thanks @IBM for hosting us.
3
9
36
1
1
11
@woosuk_k
Woosuk Kwon
7 months
@HamelHusain Why not use vLLM? :)
2
1
10
@woosuk_k
Woosuk Kwon
10 months
Congratulations to everyone in the second batch! 🔥🔥🔥
@BornsteinMatt
Matt Bornstein
10 months
We're announcing the second batch of @a16z open source AI grants today This cohort focuses on: ▶️ tools for LLM training/ hosting/ evals ▶️ visual AI models & communities Thank you to the grantees for your contributions! More info in the linked post
15
41
240
0
0
7
@woosuk_k
Woosuk Kwon
2 months
Welcome!
@neuralmagic
Neural Magic
2 months
🎉 Exciting news! Tyler Smith, one of our many talented engineers, is now Neural Magic's 3rd vLLM project committer! Check out Tyler's contributions: . We’re proud to be a leading contributor to @vllm_project . 🚀 Cheers to Tyler and the team!
2
5
26
0
0
5
@woosuk_k
Woosuk Kwon
11 months
For those who cannot access the link, please try this one instead:
1
0
5
@woosuk_k
Woosuk Kwon
6 months
Check out @luo_michael1234 's awesome work on picking the best LoRA adapters to create crisp images!
@luo_michael1234
Michael Luo
6 months
[1/5] Introducing Stylus 🖌️ - an #AI tool that automatically finds and adds the best adapters (LoRAs, Textual Inversions, Hypernetworks) to #StableDiffusion based on your prompt. 🗞️ Paper: 🌎 Project Page:
7
34
92
0
0
5
@woosuk_k
Woosuk Kwon
10 months
@andrey_cheptsov Yes! `pip install vllm megablocks` is all you need. 🚀🚀🚀
0
0
2
@woosuk_k
Woosuk Kwon
3 months
@vsreekanti @RunLLM @vllm_project It's so useful! It not only covers the docs, but also links to our Github issues! Thanks for adding it to vLLM!
0
1
3
@woosuk_k
Woosuk Kwon
1 year
@Teknium1 @zhuohan123 You are right. The majority of the vLLM’s speedup comes by batching more prompts every run. However, you can get some speedup even in a single user env, because vLLM also includes some other optimizations orthogonal to PagedAttention.
0
0
3
@woosuk_k
Woosuk Kwon
1 year
@b_arbaretier @zhuohan123 @lmsysorg @b_arbaretier Thanks for your interest and great question! Currently, we don't compile the models. We're currently exploring if torch.compile is suitable for us, or if we can enhance performance by optimizing the code for individual models.
1
0
3
@woosuk_k
Woosuk Kwon
11 months
Amazing!! 🤣🤣
@rajko_rad
Rajko Radovanović
11 months
Image conditioning on @ideogram_ai is awesome! 3D render of re-imagined vLLM logo below :) cc @zhuohan123 @woosuk_k
Tweet media one
Tweet media two
0
2
10
0
0
3
@woosuk_k
Woosuk Kwon
2 months
@natolambert @mgoin_ @vllm_project You can pip install the nightly version! export VLLM_VERSION=0.5.3.post1 export VLLM_COMMIT=16a1cc9bb2b4bba82d78f329e5a89b44a5523ac8 pip install ${VLLM_COMMIT}/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl
1
0
2
@woosuk_k
Woosuk Kwon
3 months
@natolambert @natolambert Could you please share more about the error you faced with? FlashInfer is actually required for Gemma2. You can install it with `pip install `
1
0
2
@woosuk_k
Woosuk Kwon
6 months
@ziming_mao @anuragk_ Amazing!! Congrats!! 👊👊
0
0
1
@woosuk_k
Woosuk Kwon
28 days
@KaichaoYou Congrats!!!
0
0
1