Excited to share some life updates 🥳📢:
I'll be starting as an Assistant Professor
@CarnegieMellon
@CMU_ECE
in Fall 2023. Until then, I'll be a visiting researcher at
@Meta
@MetaAI
.
I'm heading to
#ICML2022
tmr!!! DM if you want to catch up 😃☕️🍱...
📢 Announcing our new speculative decoding framework Sequoia ❗️❗️❗️
It can now serve Llama2-70B on one RTX4090 with half-second/token latency (exact❗️no approximation)
🤔Sounds slow as a sloth 🦥🦥🦥???
Fun fact😛:
DeepSpeed -> 5.3s / token;
8 x A100: 25ms / token (costs 8 x
Can sparse training achieve wall-clock time speed up on GPU?
Yes! Simple and static
#sparsity
-> 2.5x faster🚀 training MLP-Mixer, ViT, and GPT-2 medium from scratch with NO drop in accuracy.
(
#NeurIPS2021
)
[1/6]
❓Wanna host a Llama2-7B-128K (14GB weight + 64GB KV cache) at home🤔
📢 Introducing TriForce! 🚀Lossless Ultra-Fast Long Seq Generation — training-free Spec Dec! 🌟
🔥 TriForce serves with 0.1s/token on 2 RTX4090s + CPU – only 2x slower on an A100 (~55ms on
chip), 8x faster
📢My group at
@CMU_ECE
is looking for Ph.D. students in
#Algorithms
#MLSys
(ddl Dec 15)!
Let’s shed new light on classical algorithms, make ML more accessible to the general community, and advance interdisciplinary research (science?!) together!
🙏Plz help spread the world.
Do you know KV cache would easily take 160GB on Llama2-70B, e.g. 8K seqlen + 64batch size, even it has multi-group Attn?
Come and see our preliminary work on how to use a super simple cache eviction policy to reduce this bottleneck! There’re huge opportunities in this space 🫵🏻
📢 Our new work LESS leverages the observation that pretrained LLMs Attention has intrinsically sparse+lowrank structure. ☝️So at inference time, we can decompose KV Cache into constant sparse and RNN states (because lowrank attention is RNN).
This also explains why the recent
Upgrade your LLM KV cache eviction policy with LESS, our method to retain local and global information during generation with pretrained LLMs! Excited to share this at ICML!
Paper:
w/
@Xinyu2ML
,
@KyriectionZhang
, Zhangyang Wang, Yuejie Chi,
@BeidiChen
GaLore
Memory-Efficient LLM Training by Gradient Low-Rank Projection
Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank
📢
#ICML2023
23-30th🌴🌺
Please come and say
#hi
at our oral talks, poster sessions, workshop, or if you saw someone wearing
#BLACKPINK
... hair on the 🏖️
Let's chat about
#MLSys
#LLMs
#Efficiency
, new model arch, data selection or maybe hair color?!!!
Congrats team🎉 it’s been really exciting to tackle the efficiency problem along the line of long sequence generation of llm! More insights coming soon 👻
This is the first time we see a new architecture making🍎to🍎 comparison at scale with Llama-7B trained on the same 2T tokens and win (unlimited context length, lower ppl, constant kv at inference, ...)! Very excited to be part of the team! Thanks for the lead
@violet_zct
How to enjoy the best of both worlds of efficient training (less communication and computation) and inference (constant KV-cache)?
We introduce a new efficient architecture for long-context modeling – Megalodon that supports unlimited context length. In a controlled head-to-head
Want to know how we exploit sparsity without finetuning the LLM to do inference faster in wall-clock time?
We will present Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time at
#ICML
.
Come chat with us at 2pm poster session today and Oral C3 on Thursday 3 pm.
📢We're thrilled to announce that Kurt Keutzer will give the keynote speech for MLSys 2024 Young Professionals Symposium. Welcome to join us for exciting invited talks by
@Azaliamirh
, Xupeng Miao,
@jiawzhao
,
@ying11231
,
@tri_dao
on cutting-edge MLSys research!
The full
#ICML2024
🥳 Will be at MoE tutorial panel today, present 6 papers about LLM efficient training and inference Tue-Thurs, and give invited talks at Long-context modeling and Mem efficient training workshops and co-host two Fri-Sat. Excited to meet people
@icmlconf
! DM/Email or
🧑🤝🧑 Introducing MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models
🚀 MATES significantly elevates the scaling curve by selecting the data based on the model's evolving needs.
Paper:
Code:
🧵[1/n]
Come and join the discussion on long sequence generation of
#LLMs
(10am EDT). I'll talk about recent work on efficient LLM inference, e.g., H2O, StreamingLLM, Dejavu, from different perspectives: 1) Efficiency: reduce KVCache & weight IOs bottleneck 2) New ability: interesting
📢 Join us for
@LightOn
#AI
Meetup on Oct 27, 4-5 PM CEST! Dive into the latest in large language models. Highlight: Talk by
@BeidiChen
Assistant Professor at Carnegie Mellon University and Visiting Research Scientist at FAIR, Meta.⏰:
Update: After 4 years at NVIDIA, I recently joined Adobe Research and will be working remotely from Pittsburgh. If you're a student interested in multimodal content creation and seeking a research internship or collaboration, feel free to DM or email me!
Hongyi is an awesome MLSys candidate! He’s leveraged sparsity and low rank properties of activation / weight matrices in deep learning models for (communication) efficient learning.
1/ I am currently on the academic job market, applying for Assistant Professor positions in any field related to CS!
My research focuses on ML & Systems, specifically on computation- and communication-efficient distributed ML, efficient computing in LLMs, and federated learning.
@tri_dao
will present our work
#ICML2022
Monarch: Expressive Structured Matrices for Efficient and Accurate Training at Ballroom
#1
at 2pm! Come and join us in our poster session today as well. Super thrilled that we won an *outstanding paper award*!!! 🚀
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
paper page:
Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at
Excited to be involved in this upcoming innovative conference that will serve as a fresh platform for ML, signal processing, optimization, and neuroscience researchers focusing on ``sparsity"! Can't wait for it to kick off!
Announcing Conference on Parsimony and Learning (CPAL), a new annual conference for researchers in ML, signal processing, optimization, etc. who study parsimonious, low dimensional structures! (1/5)
Check out this awesome work that considers both expressiveness of the network architecture and hardware utilization!
📣 Btw Dan’s on the academic market this year!! You wouldn’t want to miss this amazing MLSys candidate who leverages math, ml, and system in every single work!
Excited about models that are sub-quadratic in sequence length and model dimension? Our Monarch Mixer paper is now on arXiv -- and super excited to present it as an oral at
#NeurIPS2023
!
Let's dive in to what's new with the paper and the new goodies from this release:
Monarch
Introduce lookahead decoding:
- a parallel decoding algo to accelerate LLM inference
- w/o the need for a draft model or a data store
- linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.
Blog:
Code:
🚀Exciting news! Join us at MLSys 2024 Young Professionals Symposium on May 13th in Santa Clara. 🎓Dive into discussions on large model training, industry vs. academia, entrepreneurship, and more. Don’t miss this chance to connect with experts & peers in the field!
#MLSys2024
🔥
Ideally: Sparse models use less compute & memory while retaining the generalization benefits of overparameterized models.
Challenge 1: Finding the right sparsity pattern (NP-Hard)
Insight: sparse and low-rank are complementary
Approach: static sparsity + low-rank approx. [3/6]
Three key advantages make Sequoia outstanding:
1) Scalable: possible to leverage large speculation budgets, adapting to hardware development trends;
2) Robust: suitable for commercial serving to accommodate various LLM applications;
3) Hardware-Aware: automatically adapts to
Challenge 2: Achieving wall-clock time speed up (sparsity is not hardware-friendly)
Insight: butterfly matrices 🦋 can represent ANY sparse matrices
Fixed and block sparsity is hardware-friendly!
Approach: flat, block butterfly matrices + low-rank [4/6]
The MLSys Seminar is back this week with our very own
@BeidiChen
! Tune in Thursday, 1:30 PM on YouTube to hear about her great work on sparsity in deep learning.
Livestream link:
#Stanford
#MachineLearning
Excited to share our latest work on understanding the SGD training dynamics of 1-layer Transformer (). We open the black box of 1-layer Transformer (self-attention + decoder) in a mathematically rigorous way.
Our findings? 🧐 The training has two distinct
@ggerganov
@EvMill
The blog about Softmax+1 plays a very important role when we were trying to identify the root cause of the sink
@Guangxuan_Xiao
can comment more!
Since this
#sparsity
can also represent FFT and more transforms, we show interesting results on
#mri
reconstruction and
#pde
solving (inspired by
#FNO
) besides nlp/cv applications.
Replace dense layers with (permute+block-sparse)*2 layers and get ~2x improvement all-over. One thing I really enjoyed in this work is the experimentation in all 3 fronts: (1) Sparse training (2) Dense2Sparse (3) Sparse2Dense(!)
Paper:
@BeidiChen
@tri_dao
Welcome to my talk "Angular Visual Hardness" at 2:00PM today at Deep Phenomena workshop. I will talk about the joint work with
@animesh_garg
,
@Anshumali_
and
@AnimaAnandkumar
how we bridge the gap between the perception of hardness in human visual systems and CNNs.
#MLSys2024
Student Travel Grant just get announced The deadline for applications is 4/24/24. Checkout Young Professional symposium chaired by
@BeidiChen
and
@guanh01
! See for further details.
@Anshumali_
Thank you so much for being an incredible advisor and guiding me over the years! I will try my best to mentor and support my future students in the same way 😃
🚨MLSys 2023 workshop proposal deadline in ~3 weeks🚨 () 𝗚𝗲𝗻𝗻𝗮𝗱𝘆 𝗣𝗲𝗸𝗵𝗶𝗺𝗲𝗻𝗸𝗼,
@tqchenml
,
@mcarbin
,
and I look forward to your submissions!
Key Dates:
- Application Deadline, Dec 16, 2022 4pm ET
- Acceptance notification: Jan 6, 2023
Today we're joined by
@BeidiChen
of
@RiceUniversity
, to discuss her work on the paper SLIDE: In Defense of Smart Algorithms Over Hardware Acceleration for Large-Scale Deep Learning Systems.
@gneubig
@Guangxuan_Xiao
Thanks
@gneubig
for the great suggestion! This is precisely on the top of our list. We’re planning to evaluate a few methods that could compress the kv states including streamllm, H2O, retrieval based etc on long-doc/context tasks — and see what are we missing 😉
🚨MLSys 2023 workshop proposal deadline in ~3 weeks🚨 () 𝗚𝗲𝗻𝗻𝗮𝗱𝘆 𝗣𝗲𝗸𝗵𝗶𝗺𝗲𝗻𝗸𝗼,
@tqchenml
,
@mcarbin
,
and I look forward to your submissions!
Key Dates:
- Application Deadline, Dec 16, 2022 4pm ET
- Acceptance notification: Jan 6, 2023
@giffmana
@tri_dao
We’re excited you like our work!!! Patch-based models are amazing🥳!
We expect more benefit for larger models because we observed the trend of more speedup from ViT/Mixer-S->B->L and GPT2-small->medium.
DINOv2+registers=♥️
We are releasing code and checkpoints for DINOv2 augmented with registers and a slightly better training recipe. No more of those pesky artifacts!
Simple one-liner, try it out:
dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')
Simulating Maxwell's equations is slow. Is close-form possible? Yes! Our work CZP () gives an accurate & sample-efficient surrogate model that predicts freq. response of a linear PDE. By RL search, it finds 2D antenna design verified by commercial software.
@giffmana
@tri_dao
🤣Zhao and Atri have been long-time collaborators of ours. They’re experts in sketching and structured matrices (the core ingredients of our method). Kaizhao and Jiaming have been a huge help 💪 in systems and deep learning theory.
@cHHillee
I believe the abs speed we got with 7B on A100 is 5.9ms/token (which is 4x hf and 2x fastertransformer?). It’s based on hf code 🥹 so room to further improve~
Why existing Spec Dec algorithms not appropriate for long seq regime?
😂Training a small draft model with 128K context for speculation sounds hard
🧐Speculate with a normal small model + streamingllm doesn't work
😉 Wait! We're no longer dealing with weight bottleneck, but KV!
Insights:
(1) Attention is naturally sparse ➡️ you don't need all KV for each generated token
(2) Spec Dec anyway requires full KV for verification ➡️ there's hope for better KV selection algorithm than H2O and Streamingllm
@activewarp
@Guangxuan_Xiao
We discovered that using one extra token is enough for 160m pretraining case, but you might be right! Larger models might need more 🤣
@AlberFuen
You could! FlexGen and Deepspeed supports that. Theoretically we can run Spec Dec on the top of these — requires a bit infrastructure tweak ~
@main_horse
@_akhaliq
I totally agree with this point!! But DejaVu and FlexGen were ICML publications — meaning they were done before Llama etc came out 🤣. If you’re insterested in sparsity in LLM — check out our more recent sparsity work H2O () & StreamingLLM
TriForce is a scalable hierarchical speculative decoding system for long sequence generation: 68m model+streamingllm ➡️Llama2 ➕ retrieved sparse KV cache ➡️ Llama2-128K.
Three core strengths of TriForce:
Training-Free: no need for additional long-context draft model training
Hierarchical Speculation: tackle the two memory bottlenecks sequentially using different draft models
Scalability and Robustness: outstanding scalability for long contexts
Exciting results:
(1) Off-loading: 8x faster for Llama2-7B-128K on two RTX4090s with 0.1s/token and 5x on a single RTX 4090 than DeepSpeed
(2) On-chip: 2.31x faster on an A100
(3) It's compatible with Decoding Tree (our own Sequoia )
(4) It can scale to
I’m looking for researchers with experiences and strong passion in large-scale image-text models to join our research team at CA. Strong knowledge on diffusion models, contrastive learning, or data curation is preferred. Team-work first, extreme hard-core, and perfection-driven.
After a short hiatus, the Stanford MLSys Seminar is coming back this quarter with a special series of episodes on foundation models!
Our first talk (ep 67!!) will be
@tri_dao
, who'll be talking about FlashAttention. Catch us *TOMORROW* at 3:30 PT:
If you're attending ICML 2024, join my 2-hour tutorial on Monday July 22 to explore the Physics of Language Model - all 6 parts. Visit: and it will be live-streamed on Zoom. BONUS: this is the premiere of Part 2.1 + 2.2, don't miss out!
#ICML2024
#MetaAI
TriForce optimizes across memory hierarchies for efficient long sequence generation on consumer devices and can potentially extend its capabilities to robots, enhancing their interaction with long-context conversations.
@SambaNovaAI
@GroqInc
@Apple
@intel
@AMD
@PuduRobotics
Tired of battling with the wild west of large language model prompting frameworks and APIs?! We’re excited to introduce Manifest, our python framework that makes prompt programming simple, interactive, and reproducible.
💻:
@ITsol4u
@tri_dao
Some of the core hashing and sketching techniques we used have been widely adopted for high dimensional sparse data, e.g. locality sensitive hashing 🤩