Woosuk Kwon @woosuk_k Twitter profile

Last Seen Profiles

@ConchitaWFans

@___syafiq

@enbmyeon

@BeeEssLOVE64

@padul_ana24821

@lukekarmali

@namino_nico

@cbcbvbcb

@JulietteMlt3

@ScharlauLeather

@body_kaka99268

@dylbarl031

@enayardb

@azif_offl2

@Chiaki048735730

@cammyclark97

@OorBilly1872

@galery_basah10

@Icatcam

@JohnStossel

@enamharisekawan

@ArmstrongA48081

@tammywcho

@TohsakaPsycho

@en0rma_stits

@jsalcedo2702

@O14Arm

@sophia_fffff

@KaiMADAOZen

@peaceandreason

@JasonFitz1

@BobYk95

@emteedlol

@alvefagh

@csplgbtqia

@NIESRorg

Woosuk Kwon

@woosuk_k

1 month

Developing @vllm_project taught me a tough lesson: to keep the GPU fully utilized, we need to pay close attention to everything happening on the CPU. Over the past month, the vLLM community conducted an in-depth study and made key optimizations, leading to significant

vLLM

@vllm_project

1 month

A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves 🚀2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s.

14

69

379

8

24

253

Woosuk Kwon

@woosuk_k

11 months

We’ve just released a new blog post comparing vLLM with DeepSpeed-FastGen. While we are happy to see the open-source technology advancements from the DeepSpeed team, we’ve got different results with more extensive performance benchmarks. vLLM is actually faster than DeepSpeed in

3

30

208

Woosuk Kwon

@woosuk_k

1 year

Exciting news! 🎉Our PagedAttention paper is now up on arXiv! Dive in to learn why it's an indispensable technique for all major LLM serving frameworks. @zhuohan123 and I will present it at @sospconf next month. Blog post: Paper:

Efficient Memory Management for Large Language Model Serving with...

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for...

arxiv.org

2

34

186

Woosuk Kwon

@woosuk_k

2 months

This Wednesday (8/21) I will be speaking about the diverse hardware support in vLLM, with a focus on AMD GPUs and Google TPUs. Sign up to learn more about vLLM!

2

18

183

Woosuk Kwon

@woosuk_k

4 months

Gemma 2 is also available in vLLM! 🎉 Check out the update in the main branch and stay tuned for the next release coming soon

[Model] Add Gemma 2 by WoosukKwon · Pull Request #5908 · vllm-project/vllm

This PR adds Gemma 2, a new family of open LLMs from Google. Two major issues to note: Attention logit soft-capping: Gemma 2 models soft-cap the attention logits. This requires changes to all the ...

github.com

Google DeepMind

@GoogleDeepMind

4 months

We're excited to unveil Gemma 2. 🛠️ Available in both 9B and 27B parameters, it delivers the best performance for its size - unlocking more possibilities for developers to build and deploy with AI. →

30

222

1K

3

23

139

Woosuk Kwon

@woosuk_k

10 months

In vLLM v0.2.6, we've introduced CUDA/HIP graph for faster model execution, and added GPTQ support (finally!). More optimizations and feature are coming... so stay tuned!

Release v0.2.6 · vllm-project/vllm

Major changes Fast model execution with CUDA/HIP graph W4A16 GPTQ support (thanks to @chu-tianxiang) Fix memory profiling with tensor parallelism Fix *.bin weight loading for Mixtral models What&...

github.com

2

17

125

Woosuk Kwon

@woosuk_k

10 months

vLLM + AMD MI300X = Blazingly-fast LLM serving! 🚀🚀🚀

AMD

@AMD

10 months

Update: Let's look at some new inference performance data on AMD Instinct MI300X

8

50

249

0

8

84

Woosuk Kwon

@woosuk_k

10 months

Check out the Mistral's official inference code at vLLM!

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for...

A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm

github.com

Zhuohan Li

@zhuohan123

10 months

Excited to have first-hand official support of the Mixtral MoE model in vLLM from @MistralAI ! Getting started with Mixtral with the latest vLLM now: . Be sure to check their announcing blog: Joint with @woosuk_k @PierreStock

0

5

73

4

66

Woosuk Kwon

@woosuk_k

8 months

The new vLLM release includes some optimizations for Gemma and Mixtral, and finally supports 8-bit GPTQ. Please give it a try!

Simon Mo

@simon_mo_

8 months

vLLM v0.3.3 is released with Starcoder2 @BigCodeProject and Inferentia @awscloud support. I'm also excited about the addition of guided decoding* (JSON, regex) in server leveraging @OutlinesOSS . *experimental, the schema take some time to compile but will be cached.

2

24

92

0

8

66

Woosuk Kwon

@woosuk_k

3 months

We've partnered with @AIatMeta to support the 405B model from Day 1. Enjoy!

vLLM

@vllm_project

3 months

🚀 Exciting news! In partnership with @AIatMeta , vLLM officially supports Llama 3.1! 🦙✨ For Llama 3.1 405B, vLLM supports FP8 quantization on single machine and pipeline parallelism for multi-node serving. Learn more in our latest blog post:

1

18

84

0

2

48

Woosuk Kwon

@woosuk_k

3 months

Linux Foundation is home to many important open source projects like Linux and PyTorch. Today we are excited to announce that @vllm_project is joining @LFAIDataFdn , an AI-focused sub-foundation under the Linux Foundation. vLLM will keep open and trusted!

vLLM

@vllm_project

3 months

Two exciting updates! * vLLM is already widely adopted, and we want to ensure it has open governance and longevity. We are starting to join @LFAIDataFdn ! * We are doubling down in performance. Please checkout our roadmap.

0

14

88

3

4

44

Woosuk Kwon

@woosuk_k

7 months

We finally made a Twitter account for vLLM @vllm_project ! Please follow this account for the latest updates on vLLM!

Cade Daniel 🇺🇸

@cdnamz

7 months

vLLM made a Twitter! Go give them a follow @vllm_project And fun vLLM meetup btw! Thanks for hosting @Roblox

3

8

30

2

5

40

Woosuk Kwon

@woosuk_k

10 months

We've just released v0.2.5 which includes this performance improvement (contributed by Antoni at @anyscalecompute ). Please try it out!

Matt Shumer

@mattshumer_

10 months

Looks like Mixtral on vLLM is about to get a LOT faster

5

22

226

0

4

38

Woosuk Kwon

@woosuk_k

10 months

Great news! Phi-2 also works with vLLM and greatly benefits from our newest integration with CUDA graphs. Give it a try on vLLM!

clem 🤗

@ClementDelangue

10 months

Phi-2 by @MicrosoftAI is now the #1 trending model on @huggingface (). 2024 will be the year of smoll AI models!

12

63

429

0

3

34

Woosuk Kwon

@woosuk_k

2 months

Always happy to see this kind of open source release. Thanks for the huge contributions to the community!

Byron Hsu

@hsu_byron

2 months

(1/n) Training LLMs can be hindered by out-of-memory, scaling batch size, and seq length. Add one line to boost multi-GPU training throughput by 20% and reduce memory usage by 60%. Introducing Liger-Kernel: Efficient Triton Kernels for LLM Training.

21

170

972

2

1

30

Woosuk Kwon

@woosuk_k

9 months

Thanks for your continuous contribution to vLLM! Really happy to have you in our community :)

Casper Hansen

@casper_hansen_

9 months

AWQ now runs up to 2.66x faster in vLLM after my PR was merged. Thanks to @woosuk_k for a quick review and merge 🥳 I hope we can bring more of these optimizations to the community so we can run models on smaller budgets!🤗

9

25

193

0

3

29

Woosuk Kwon

@woosuk_k

6 months

We are still actively looking for new reviewers! Please feel free to join us!

vLLM

@vllm_project

6 months

We are doubling our committer base for vLLM to ensure it is best-in-class and a truly community effort. This is just a start. Let's welcome @KaichaoYou , @pcmoritz , @nickhill33 , @rogerw0108 , @cdnamz , @robertshaw21 as committers and thank you for your great work! 👏

2

4

32

0

27

Woosuk Kwon

@woosuk_k

1 year

Please come see us at Ray Summit! @zhuohan123 and I will talk about vLLM, a state-of-the-art open-source LLM inference engine, which actually uses #Ray for distributed inference. Don't miss it!

Robert Nishihara

@robertnishihara

1 year

Ray Summit this month will be 🔥🔥 🤯 ChatGPT creator @johnschulman2 🧙‍♀️ @bhorowitz on the AI landscape 🦹‍♂️ @hwchase17 on LangChain 🧑‍🚀 @jerryjliu0 on LlamaIndex 👨‍🎤 @zhuohan123 and @woosuk_k on vLLM 🧜 @zongheng_yang on SkyPilot 🧑‍🔧 @MetaAI on Llama-2 🧚‍♂️ @Adobe on Generative AI in

8

45

207

1

24

Woosuk Kwon

@woosuk_k

9 months

Super cool! This seems like the ultimate library everyone has wanted to have. Great work @ye_combinator !!

Zihao Ye

@ye_combinator

9 months

(1/4) Announcing FlashInfer, a kernel library that provides state-of-the-art kernel implementations for LLM Inference/Serving. FlashInfer's unique features include: - Comprehensive Attention Kernels: covering prefill/decode/append attention for various KV-Cache formats (Page

2

82

298

2

0

21

Woosuk Kwon

@woosuk_k

1 year

Make LLM serving easy and fast with vLLM!

Zhuohan Li

@zhuohan123

1 year

🌟 Thrilled to introduce vLLM with @woosuk_k ! 🚀 vLLM is an open-source LLM inference and serving library that accelerates HuggingFace Transformers by 24x and powers @lmsysorg Vicuna and Chatbot Arena. Github: Blog:

20

254

1K

0

5

22

Woosuk Kwon

@woosuk_k

1 month

The vLLM meetup with NVIDIA is happening now! Join us and learn about our latest updates!

NVIDIA AI Developer

@NVIDIAAIDev

2 months

Dive into the latest in #AI with expert talks, updates, and networking. Join us for the @vllm_project & NVIDIA Triton User Meetup. 📅 Mon, Sept 9, 4-9 PM 📍Gallery 308, Fort Mason, SF. Secure your spot now ➡️ ✨

1

11

49

0

18

Woosuk Kwon

@woosuk_k

7 months

Our 3rd meetup is happening on April 2nd. Come join us!

Simon Mo

@simon_mo_

7 months

The vLLM Team is excited to announce our Third vLLM Meetup in San Carlos on April 2nd (Tuesday). We will be discussing feature updates and hear from you! We thank @Roblox for hosting the event!

0

7

27

2

1

18

Woosuk Kwon

@woosuk_k

1 year

@chrisatgradient Thanks for reporting the bug! The paged kernel is now fixed by

[BugFix] Fix NaN errors in paged attention kernel by WoosukKwon · Pull Request #936 · vllm-projec...

Fixes #641 This PR fixes the paged attention kernel. Currently, the kernel computes attn_weight * value for all tokens in a value block, even if some of them are not included in the context. It is ...

github.com

1

0

17

Woosuk Kwon

@woosuk_k

5 months

Join us for our 4th meetup on June 11th. We hope to see you there!

vLLM

@vllm_project

5 months

We are holding the 4th vLLM meetup at @Cloudflare with @bentomlai on June 11. Join us to discuss what's next in production LLM serving! Register at

0

8

25

1

2

17

Woosuk Kwon

@woosuk_k

1 year

Excited to see that our paged attention algorithm is adopted! Looking forward to it!

NVIDIA AI Developer

@NVIDIAAIDev

1 year

Just announced - NVIDIA TensorRT-LLM supercharges large language model #inference on NVIDIA H100 Tensor Core GPUs. #LLM

15

48

156

0

3

16

Woosuk Kwon

@woosuk_k

1 year

Getting and managing GPUs on cloud has grown increasingly challenging nowadays. Check out our latest blog post with @skypilot_org to discover how SkyPilot is making the development and deployment of vLLM easier!

SkyPilot

@skypilot_org

1 year

UC Berkeley's vLLM + SkyPilot speeds up LLM serving by 24x 🤩 Our user blog post on how SkyPilot combated GPU availability for #vLLM , allowing them to focus on AI and not infra. (Also includes a 1-click guide to run it on your own cloud account!)

1

18

80

0

16

Woosuk Kwon

@woosuk_k

2 months

The next vLLM Meetup will be hosted with NVIDIA! Please join us on September 9th!

vLLM

@vllm_project

2 months

Join us on Monday, September 9 at Fort Mason to discuss recent performance enhancements in vLLM. This is a collaboration with NVIDIA Triton team.

0

4

28

0

15

Woosuk Kwon

@woosuk_k

1 year

Please join our first meetup and share your experience with vLLM!

0

1

13

Woosuk Kwon

@woosuk_k

4 months

Check out the improved multi-modality support in @vllm_project by @rogerw0108 !

Roger Wang

@rogerw0108

4 months

More exciting news around multi-modality in the upcoming v0.5.1 @vllm_project release! With a much simpler interface, vLLM will now support dynamic image embedding shapes for models such as LLaVA-NeXT and Phi-3-Vision!

4

12

51

0

1

13

Woosuk Kwon

@woosuk_k

1 month

Try out Pixtral with vLLM!

vLLM

@vllm_project

1 month

🖼️ pip install -U vLLM vllm serve mistralai/Pixtral-12B-2409 --tokenizer_mode mistral --limit_mm_per_prompt 'image=4' --max_num_batched_tokens 16384

5

33

320

0

13

Woosuk Kwon

@woosuk_k

1 year

Check out this wonderful article from @anyscalecompute ! It shows vLLM delivers 23x speedup on OPT-13B!

Cade Daniel 🇺🇸

@cdnamz

1 year

I wrote about a 23x improvement (!) in LLM live-inference throughput, measured on OPT-13B on A100. There are 2 new innovations which make this possible: Continuous batching & PagedAttention. Short thread below; see writeup, experiments, and results at

2

52

245

0

2

13

Woosuk Kwon

@woosuk_k

3 months

We will have the 5th vLLM meetup with AWS next Wednesday! Please join us and learn more about our recent progress!

vLLM

@vllm_project

3 months

We are excited to invite everyone to our 5th meetup with AWS on July 24 in SF (next Wed!). The team will share recent progress in FP8, pipeline parallel, and various work on perf. The space is limited to 150 so plz register ASAP!

1

8

18

0

1

13

Woosuk Kwon

@woosuk_k

6 months

Congrats!!!

Zhanghao Wu

@Michaelvll1

6 months

I am honored to share that our recent paper won the Outstanding Paper Award in NSDI’24! The paper explores the policy design of our SkyPilot managed spot for @skypilot_org : Can’t Be Late: Optimizing Spot Instance Savings under Deadlines It would not be possible, if it were not

10

6

90

0

13

Woosuk Kwon

@woosuk_k

9 months

Please check out our second meetup on Jan 31st!!

Simon Mo

@simon_mo_

9 months

We are hosting The Second vLLM Meetup in downtown SF on Jan 31st (Wed). Come to chat with vLLM maintainers about LLMs in production and inference optimizations! Thanks @IBM for hosting us.

3

9

36

1

11

Woosuk Kwon

@woosuk_k

7 months

@HamelHusain Why not use vLLM? :)

2

1

10

Woosuk Kwon

@woosuk_k

10 months

Congratulations to everyone in the second batch! 🔥🔥🔥

Matt Bornstein

@BornsteinMatt

10 months

We're announcing the second batch of @a16z open source AI grants today This cohort focuses on: ▶️ tools for LLM training/ hosting/ evals ▶️ visual AI models & communities Thank you to the grantees for your contributions! More info in the linked post

15

41

240

0

7

Woosuk Kwon

@woosuk_k

2 months

Welcome!

Neural Magic

@neuralmagic

2 months

🎉 Exciting news! Tyler Smith, one of our many talented engineers, is now Neural Magic's 3rd vLLM project committer! Check out Tyler's contributions: . We’re proud to be a leading contributor to @vllm_project . 🚀 Cheers to Tyler and the team!

2

5

26

0

5

Woosuk Kwon

@woosuk_k

11 months

For those who cannot access the link, please try this one instead:

Notes on vLLM v.s. DeepSpeed-FastGen

TL;DR:

blog.vllm.ai

1

0

5

Woosuk Kwon

@woosuk_k

6 months

Check out @luo_michael1234 's awesome work on picking the best LoRA adapters to create crisp images!

Michael Luo

@luo_michael1234

6 months

[1/5] Introducing Stylus 🖌️ - an #AI tool that automatically finds and adds the best adapters (LoRAs, Textual Inversions, Hypernetworks) to #StableDiffusion based on your prompt. 🗞️ Paper: 🌎 Project Page:

7

34

92

0

5

Woosuk Kwon

@woosuk_k

10 months

@andrey_cheptsov Yes! `pip install vllm megablocks` is all you need. 🚀🚀🚀

0

2

Woosuk Kwon

@woosuk_k

3 months

@vsreekanti @RunLLM @vllm_project It's so useful! It not only covers the docs, but also links to our Github issues! Thanks for adding it to vLLM!

0

1

3

Woosuk Kwon

@woosuk_k

1 year

@Teknium1 @zhuohan123 You are right. The majority of the vLLM’s speedup comes by batching more prompts every run. However, you can get some speedup even in a single user env, because vLLM also includes some other optimizations orthogonal to PagedAttention.

0

3

Woosuk Kwon

@woosuk_k

1 year

@b_arbaretier @zhuohan123 @lmsysorg @b_arbaretier Thanks for your interest and great question! Currently, we don't compile the models. We're currently exploring if torch.compile is suitable for us, or if we can enhance performance by optimizing the code for individual models.

1

0

3

Woosuk Kwon

@woosuk_k

11 months

Amazing!! 🤣🤣

Rajko Radovanović

@rajko_rad

11 months

Image conditioning on @ideogram_ai is awesome! 3D render of re-imagined vLLM logo below :) cc @zhuohan123 @woosuk_k

0

2

10

0

3

Woosuk Kwon

@woosuk_k

2 months

@natolambert @mgoin_ @vllm_project You can pip install the nightly version! export VLLM_VERSION=0.5.3.post1 export VLLM_COMMIT=16a1cc9bb2b4bba82d78f329e5a89b44a5523ac8 pip install ${VLLM_COMMIT}/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl

1

0

2

Woosuk Kwon

@woosuk_k

3 months

@natolambert @natolambert Could you please share more about the error you faced with? FlashInfer is actually required for Gemma2. You can install it with `pip install `

1

0

2

Woosuk Kwon

@woosuk_k

6 months

@ziming_mao @anuragk_ Amazing!! Congrats!! 👊👊

0

1

Woosuk Kwon

@woosuk_k

3 months