Say hello to Grok-1's new PyTorch+HuggingFace edition! π 314 billion parameters, 3.8x faster inference. Easy to use, open-source, and optimized by Colossal-AI. π€ Dive in:
#Grok1
#ColossalAI
π
Download Now:
Exciting News from Open-Sora! π They've just made the ENTIRE suite of their video-generation model open source! Dive into the world of cutting-edge AI with access to model weights, comprehensive training source code, and detailed architecture insights. Start building your dream
π π Build your own video generation model like
#Sora
! Experience the power of replication without the price tag! Open-Sora delivers a low-cost implementation of Sora, cutting costs by a staggering 46%. Expand your sequences to nearly a million with this innovative open-source
Want to train a model like
#Sora
? Check out our new project
#OpenDiT
!
OpenDiT is an easy-to-use, fast, and memory-efficient system for training and deploying DiT models, which are the foundation of models like Sora.
With OpenDiT, you can achieve:
* Up to 80% faster in training
*
Speedup Open-Sora's training by 3x and inference by 2x with our novel DSP (Dynamic Sequence Parallelism)! For 10s 512x512 videos, Open-Sora's inference time:
1xH800: 106s
8xH800: 45s
8xH800+DSP: 22s
DSP can be seamlessly adapted to all multi-dimensional transformers, unlocking
I am happy to share that our paper has been accepted by ICLR as an ORAL paper (1.2% acceptance rate).
InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning
InfoBatch randomly prunes a portion of less informative samples based on the
I'm grateful to graduate from
@Berkeley_EECS
with the Lotfi A. Zadeh Prize. I'm excited to announce that I will join the National University of Singapore as a tenure-track assistant professor at the Department of Computer Science in
@NUSComputing
I am happy to share that our paper won the Outstanding Paper Award of ACL.
We propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in LLM training.
π Get ready for cinematic magic with Open-Sora! π It generates 16s & 720p video. Say hello to seamless storytelling, where your vivid imagination comes to life in high-definition with just a prompt! πΉβ¨ Open-Sora's bucket strategy redefines efficiency, with only 64 GPUs,
Students of
@UCBerkeley
usually got the Ph.D. lollipop when they submitted dissertations. I could not do that because of COVID-19. However,
@GradDivision
mailed it to me from 13590.66 km away! What a great tradition! What a big surprise! Thanks a lot!
@GradDivision
@Berkeley_EECS
Our paper was published on May 26th of 2021 and it was also accepted by ACL.
We clearly named the method ring self-attention (i.e. ring attention).
I did not find any substantial difference between ring self-attention and ring attention.
To my knowledge, our work is the first
New paper w/
@matei_zaharia
@pabbeel
on transformers with large context size.
We propose RingAttention, which allows training sequences that are device count times longer than those of prior state-of-the-arts, without attention approximations or incurring additional overhead.
Colossal-AI team just released SwiftInfer - a TensorRT-based implementation of StreamingLLM, boosting inference performance by a whopping 46%! In the scenario of long-text multi-round conversations, StearmingLLM can improve the ability of LLM to understand and remember context,
I'll attend NeurIPS: please let me know if you want to chat or grab a coffee (or watch the FIFA World Cup)! DMs are open.
Excited to be finally back at an in-person NeurIPS after 3 years π
#NeurIPS2022
Would you like to accelerate AI model training by 10x? Do you want an easy-to-use system that abstracts away all the repetitive nonsense from under the hood?
Fret not, Colossal-AI is now open-source!
All of our 5 paper submissions have been accepted by
#CVPR2022
Congrats to my students! Hopefully, see you in New Orleans! More details can be found here:
How can we use Adversarial Learning to speed up the training process of AI models? I am happy to share our new paper, which was recently accepted by ICLR'22
I'm happy to share that my students and I recently built a tech startup
@HPCAITech
We are working on AI systems (e.g. ). We have raised 4.7 million USD in just 3 months :-)
The former premier of China passed away. He is a visionary leader who dedicates himself to the progress and well-being of his nation. Emerging from humble origins, he, through his exceptional talent and wisdom, ascended to the nation's highest echelons of leadership. Tasked with
π₯ Exciting news in AI! 20% enhancement in training efficiency for LLaMA3 8B and 70B! Colossal-AI offers tailored solutions for LLaMA3 models, significantly boosting training efficiency and setting new standards with exceptional performance. Check out the open-source project on
NUS computer science Ph.D. program (full scholarship) has a Spring intake. The deadline is June 15th. Here is the application information:Β My research group's information can be found at
I am happy to share that our paper won the Outstanding Paper Award of ACL.
We propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in LLM training.
I'd like to introduce the Colossal-AI system, which can potentially help you to train/deploy super-large AI models quickly without changing your code.
GitHub:
Paper:
I am happy to see our LAMB optimizer was included in MLPerf's BERT implementation.
Google finished BERT training in 24 seconds based on MLPerf.
However, MLPerf used its own convergence metric, which is different from Mr. Jacob Devlin's baseline.
It is my pleasure to be the session chair for ML: Optimization at
#AAAI23
Our session will cover the latest techniques in machine learning optimization.
If you are interested in improving the efficiencyΒ of chatGPT, stable diffusion, DALLΒ·E 2, and AlphaFold 2, come to talk to us!
π₯ Getting Chatbot Arena model rankings with 2000Γ less time (5 minutes) and 5000Γ less cost ($0.6), simply by mixing the off-the-shelf benchmarks!
π Introducing our MixEval, a revolutionary
#LLMs
evaluation paradigm that's fast, cheap, and precise! By blending real-world
How to get βοΈChatbot ArenaβοΈ model rankings with 2000Γ less time (5 minutes) and 5000Γ less cost ($0.6)?
Maybe simply mix the classic benchmarks.
π Introducing MixEval, a new π₯gold-standardπ₯ LLM evaluation paradigm standing on the shoulder of giants (classic benchmarks).
I will be speaking at the 37th AAAI Conference on Artificial Intelligence on Feb 7th and 8th! Iβll be discussing efficiently training large AI models like GPT-3 and Stable Diffusion. See you there.
#AAAI23
#AAAI
#AI
#ArtificialIntelligence
@RealAAAI
We are actively seeking talented postdoctoral researchers specializing in LLM and MLSys. If you have a passion for these fields, please click on the links below for more information and to apply.
Berkeley was ranked as No.1 by Forbes on the top US colleges list! Berkeley is the first public university to win Forbesβ top ranking. That's amazing! I miss CAL :-)
Congratulations to Prof.Β Jack Dongarra for winning the Turing award! Well deserved! BTW, I want to mention that my advisor Prof. James Demmel
@Berkeley_EECS
also made a significant contribution to HPC and NumericalΒ Libraries. This picture can tell us something :-)
Congratulations toΒ Bill Gropp, who was recently elected asΒ IEEE Computer Society president! Bill was the host of my faculty job interview at the UIUC CS department. He gave me a good piece of career advice. He was a very nice person. I wish him the best of luck for the new job!
To have a happy life, we should find more people loving us, instead of minimizing the number of people hating us. The number of people hating us really does not matter, but the number of people loving us means how far we can go :-)
Our two tutorials have been accepted by
@RealAAAI
. It is my privilege to teach AI to top AI experts in the world. See you in Washington DC!
#AAAI23
Tutorial 1: Colossal-AI: Scaling AI Models in Big Model Era
Tutorial 2: Large-scale Deep Learning Optimization Techniques
Based on current techniques, LLM query will be more expensive than search engine query.
LLM inference is mainly using matrix-matrix multiply. Search engine (e.g. PageRank algorithm) is based on matrix-vector multiply. Each database query is just a matching.
Matrix-matrix
I'd like to share a paper recently published by Google: "Exploring the limits of Concurrency in ML Training on Google TPUs". It shows how Google finish the training of large deep learning models within one minute.
A peaceful protest at
@UCBerkeley
. I am happy to see they draw lines on the ground to implement social distancing. I find many of them are actually elderly people, who are vulnerable to COVID-19 and violence. I want to thank them for what they did for the community.
Excited to kick off the new semester! There's nothing quite like teaching in a bustling classroom packed with so much talent. Here's to a great term ahead! ππ
Our new work withΒ
@quocleix
andΒ
@tanmingxing
.
People now can finish the ImageNet training in 1 minute. However, a 75.9% convergence accuracy is probably too low to be practical. We achieve 83% ImageNet top-1 accuracy in 1 hour, which is a speed world record.
Excited to share our
#ICCV2023
paper: Fine-tuning Vision-Language Models without Zero-Shot Transfer Degradation (ZSCL). ZSCL outperforms the pre-trained model on downstream tasks and maintains its zero-shot transferability to other tasks.
paper:
blog:
Introducing
#ICLR2022
Concurrent Adversarial Learning for Large-Batch Training
Motivation: Large-batch training has become a widely used technique when training neural networks with a large number of GPU/TPU processors.
Check our new paper at
#NeurIPS2022
Random Sharpness-Aware Minimization
We propose a novel random smoothing-based SAM (R-SAM) algorithm. R-SAM essentially smooths the loss landscape and improves the approximation of the inner maximization.
Excited to introduce our
#ICCV2023
paper Dataset Quantization (DQ). DQ achieves lossless training performances with 2% data keep ratio on language tasks and 60% data keep ratio on vision tasks. Just check out our paper and project:
The major source of the energy cost for training AI models comes from moving the data?
Communication costs 10x more energy than computation. Please correct me if I'm wrong :-)
For GPT-3:
The communication energy cost is 4.7e+26 PJ.
The computation energy cost is 3.6e+25 PJ.
PyTorch implementation of LARS for ImageNet:
PyTorch implementation of LAMB for ImageNet:
Both of them can achieve at least 76.7% accuracy in 90 epochs for both large batch sizes and small batch sizes.
Our new paper: ONES automatically manages the elasticity of each AI job based on the workload to maximize GPU utilization and improve scheduling efficiency. Experiments on 64 GPUs show great results. This paper will appear on
@Supercomputing
#SC21
()
I just published Embedding Training With 1% GPU Memory and 10 Times Less Budget, an Open Source Solution for Super-Large Recommendation Model Training on a Single GPU
FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders
abs:
Compared to previous sota, FaceMAE consistently reduces at least 50% error rate on LFW, CFP-FP and AgeDB
Chinese Academy of Sciences released a benchmark for fast AI training. They are not the first team to do this. MLPerf is already a huge success. But they have a good summary of how researchers reduced the ImageNet training time from 29 hours to 1 minute.
Neural Network Diffusion
Diffusion models have achieved remarkable success in image and video generation. In this work, we demonstrate that diffusion models can also generate high-performing neural network parameters. Our approach is simple, utilizing an autoencoder and a
Thrilled to share that I will be joining ETH Zurich (
@ETH_en
) as an assistant professor in the CS department (
@CSatETH
). Super excited to move to Switzerland this autumn and work with the amazing students and faculty.