Robert Nishihara Profile
Robert Nishihara

@robertnishihara

6,597
Followers
660
Following
96
Media
1,376
Statuses

Co-founder @anyscalecompute . Co-creator of @raydistributed . Previously PhD ML at Berkeley.

Joined March 2009
Don't wanna be here? Send us removal request.
@robertnishihara
Robert Nishihara
2 years
I remember in 2016 when @ApacheSpark set the record for sorting 100TB in the most cost-efficient way ($144 in 2016, $115 in today's prices). Today, @raydistributed broke the $1 / TB barrier and set the world record at $97! 🔥📈🥳🎂🎗️🥂
6
70
515
@robertnishihara
Robert Nishihara
9 months
Function calls have been a massive gap in the open source ecosystem (and the most common feature request). We benchmarked function calling on a variety of open and proprietary models. Impressively, Mistral-7B performs on par with GPT-3.5. Here's how they stack up 🤯🤯 ⚫️
@anyscalecompute
Anyscale
9 months
We're announcing new features and models today. 🔵 JSON mode ⚫️ function calling Try them out with our API.
4
13
168
19
70
501
@robertnishihara
Robert Nishihara
1 year
This in-depth case study sheds light on when you can achieve GPT-4 level performance with a fine-tuned 7B parameter model. Take SQL generation as an example. Accuracy 🧿 Llama-2-7B: 3% 🧿 GPT-4: 79% 🧿 Llama-2-7B (fine-tuned): 86% Out of the box, GPT-4 crushes Llama-2
@CyrusHakha
kourosh hakhamaneshi
1 year
🚀 Exploring Llama-2’s Quality: Can we replace generalist GPT-4 endpoints with specialized OSS models? Dive deep with our technical blogpost to understand the nuances and insights of fine-tuning OSS models. 🔗 🧵 Thread 1/N👇
16
117
539
7
94
465
@robertnishihara
Robert Nishihara
2 years
While everyone's talking about training giant models, companies like @Instacart are quietly achieving 10x performance improvements by training and deploying many smaller models. Here's how they're doing it. 🔥🔥
4
69
420
@robertnishihara
Robert Nishihara
1 year
We've built a ton of #LLM applications recently. Reasoning about performance & feasibility is painful without reference points. Here are the reference points we use to anchor our intuition (inspired by @JeffDean 's "Numbers every engineer should know").
Tweet media one
14
94
391
@robertnishihara
Robert Nishihara
11 months
An important systems bottleneck when working with LLMs is model loading times, but if you get the details right, you can speed up standard implementations by around 20x (over 10 minutes down to around 35 seconds for Llama-2-70B). There are a few bottlenecks to numbers to think
@cdnamz
Cade Daniel 🇺🇸
11 months
How long does it take to download Llama2 70B? On the 4x 25 Gbps NICs that aws.p4de's have, it should take ~10s. Yet in production we've observed much higher times, which makes autoscaling less responsive + more expensive. This blog post shows how we've reduced download & init
Tweet media one
1
37
231
2
60
382
@robertnishihara
Robert Nishihara
9 months
Faster Mixtral? Much more to come here. We make deep investments in open source AI. If you'd like to help build open source AI or optimize LLM performance, join us at @anyscalecompute . DM me 🚢
Tweet media one
@woosuk_k
Woosuk Kwon
9 months
We've just released v0.2.5 which includes this performance improvement (contributed by Antoni at @anyscalecompute ). Please try it out!
0
4
38
4
25
268
@robertnishihara
Robert Nishihara
2 years
One of our goals with @raydistributed has been to provide a great off-the-shelf experience for beginners as well as the performance and flexibility required by power users. @OpenAI is on the "power users" end of the spectrum.
2
49
303
@robertnishihara
Robert Nishihara
2 years
Exciting to see Quokka at the top of Hacker News (written by Ziheng Wang). In ~1000 lines of Python, Quokka is a high performance fault-tolerant query engine built on 1⃣ Ray ( @raydistributed ) - distributed execution 2⃣ Polars - fast dataframes 3⃣ Arrow ( @ApacheArrow ) - fast I/O
Tweet media one
7
38
295
@robertnishihara
Robert Nishihara
10 months
OpenRLHF is a high-performance RLHF training framework based on @raydistributed and DeepSpeed.
Tweet media one
3
62
275
@robertnishihara
Robert Nishihara
4 months
Lots of things 😆 There have been a handful of rewrites from scratch over the years with @raydistributed (actually from scratch). Some things that come to mind 💻 Programming languages: First prototype was in Rust. Then C++ (heavy multithreading with gRPC as the core RPC
@pschafhalter
Peter Schafhalter
4 months
@robertnishihara @mrry Hi Robert, considering all the lessons learned building Ray, are there any changes you would have made back when you first started the project? Personally, I always wondered whether the flexibility of dynamic task graphs would eventually lead to performance bottlenecks.
0
0
5
7
41
238
@robertnishihara
Robert Nishihara
1 year
I was surprised by how Llama-2 stacks up against GPT-3.5 and GPT-4 on getting the facts right. We investigated factuality in the context of summarization. Summarization is one of the most immediately practical applications of LLMs. Good summaries have a few key properties: 1⃣
@waleedk
Waleed Kadous
1 year
📊 Case study of Llama-2’s capability in summarization tasks. 📌 TLDR: In summarization; Llama-2-70b is as factual as GPT-4 while being 30x cheaper 🧵 👇
9
64
306
5
27
210
@robertnishihara
Robert Nishihara
1 year
Ray Summit this month will be 🔥🔥 🤯 ChatGPT creator @johnschulman2 🧙‍♀️ @bhorowitz on the AI landscape 🦹‍♂️ @hwchase17 on LangChain 🧑‍🚀 @jerryjliu0 on LlamaIndex 👨‍🎤 @zhuohan123 and @woosuk_k on vLLM 🧜 @zongheng_yang on SkyPilot 🧑‍🔧 @MetaAI on Llama-2 🧚‍♂️ @Adobe on Generative AI in
8
45
207
@robertnishihara
Robert Nishihara
9 months
Some folks noticed that an interesting extension to @OpenAI 's JSON capabilities. With @anyscalecompute , not only can you force the model to generate JSON, you can specify the exact schema. Specify the types, include arrays, nest objects together, ... Super powerful for making
Tweet media one
@jxnlco
jason liu
9 months
Landing a PR to support @anyscalecompute 's new json schema mode
Tweet media one
2
2
36
9
24
195
@robertnishihara
Robert Nishihara
2 years
#RaySummit is happening in 1 week! If you want to learn how companies like @OpenAI , @Uber , @Cruise , @Shopify , @lyft , @Spotify , and @Instacart are building their next generation ML infrastructure, join us!
#RaySummit is almost here! Don’t miss out on: 🌁 In-person networking in SF 🎒 3 in-depth Ray training sessions ⚙️ 40+ technical sessions and lightning talks 🎤 Speakers from @MetaAI , @Spotify , @IBM & more ...and much more!
1
8
19
16
64
174
@robertnishihara
Robert Nishihara
1 year
Just tried out Code Llama (34B) on Anyscale Endpoints. Impressive work from @MetaAI , and I'm proud to see our team at @anyscalecompute ship this the same day. Try it out:
Tweet media one
Tweet media two
8
39
182
@robertnishihara
Robert Nishihara
4 years
We began @raydistributed several years ago at @UCBerkeley (in @ucbrise , @amplab , and @berkeley_ai ). Today there are 300+ contributors from 75+ companies. Learn how companies use Ray to build scalable AI applications in production at #raysummit !
5
63
168
@robertnishihara
Robert Nishihara
1 year
Nearly all LLMs will be multi-modal. Multi-modality is another 10x UX improvement in the same way that chat was. But multi-modality is hard to do, and it's expensive. This article by @BytedanceTalk gives a taste of where things are headed (and how they're used in TikTok).
@BytedanceTalk , the company behind TikTok, uses Ray for fast & cheap offline inference with multi-modal #LLMs . They generate embeddings for a staggering 200 TB of image and text data using a model with >10B parameters. 🧵 Thread below 👇
2
31
119
1
29
172
@robertnishihara
Robert Nishihara
1 year
As far as I know, #Alpa + #Ray is the most performant and scalable way to train LLMs with JAX (tested on up to 1024 A100 GPUs). For OPT-175B, the throughput is 21% higher than @Meta 's original work (179 TFLOPs / GPU versus 147). That's *without* manual partitioning. 🤯🤯
ICYM our blogs on Ray and Generative AI. We have a three-part series on how to use Ray to productionize common generative AI model workloads. Here are parts 1 and 2: 👉 👉 #Ray for #GenerativeAI #workloads
0
12
56
3
29
171
@robertnishihara
Robert Nishihara
11 months
I'm so proud of what we launched last week at Ray Summit 🤩🤩🤩🤩🤩🤩🤩 🎉 Anyscale Endpoints supports fine-tuning 🎉 Our LLM fine-tuning API. Fine-tuning is essential for cost reduction. Fine-tuning can enable task-specific performance superior to GPT-4 at 1/300th the cost 🤯
Tweet media one
Tweet media two
Tweet media three
13
30
167
@robertnishihara
Robert Nishihara
9 months
Today we're releasing new @anyscalecompute features. ☑️ JSON mode ☑️ function calling ☑️ new models This has been a massive gap in the open source LLM ecosystem. 📜 JSON mode: Outputs valid JSON based on your schema requirements. 📞 Function calling: Lets the LLM choose a
6
19
141
@robertnishihara
Robert Nishihara
1 year
Amazon released LightGPT on May 24. We deployed it in Aviary in 5 minutes. So what? Productionizing AI requires models + serving infra. Progress on open source *models* has been astonishing, open source *serving infrastructure* has lagged behind.
4
30
145
@robertnishihara
Robert Nishihara
1 year
We build a lot of our own RAG-based LLM applications internally. One is a chatbot designed to answer questions about @raydistributed (which we've actually used to improve our own documentation 😲). This is a thorough guide to building and productionizing RAG applications that
@GokuMohandas
Goku Mohandas
1 year
Excited to share our production guide for building RAG-based LLM applications where we bridge the gap between OSS and closed-source LLMs. - 💻 Develop a retrieval augmented generation (RAG) based LLM application from scratch. - 🚀 Scale the major workloads (load, chunk, embed,
Tweet media one
Tweet media two
Tweet media three
Tweet media four
33
266
1K
3
18
147
@robertnishihara
Robert Nishihara
9 months
Thrilled to announce some new features on Anyscale Endpoints! @anyscalecompute 🕸️Embedding endpoints: Use gte-large via an OpenAI compatible API to build your RAG applications at $0.05/M tokens - HALF the cost of OpenAI & Cohere. 🎛️ Llama2 70B fine-tuning: Now you can also fine
5
17
134
@robertnishihara
Robert Nishihara
9 months
If you want to compare OpenAI side by side with open models (Llama 2, Mistral, Zephyr, ...), check out Anyscale Endpoints. We provide an OpenAI-compatible API (for inference and fine-tuning).
Tweet media one
@YirenLu
Yiren Lu
9 months
One consequence of the OpenAI drama is that companies will be taking a much closer look at fine-tuning open-source models as a replacement for third-party APIs. It used to be a cost/performance argument, but if you can't trust the API provider to have employees in a week...
3
4
37
2
19
120
@robertnishihara
Robert Nishihara
10 months
One of the most common asks we get is for public (and reproducible) performance benchmarks. LLM inference performance benchmarks are subtle, and this is a rapidly evolving space, so numbers quickly become stale. But to make comparisons, we need to be talking about the same
4
32
121
@robertnishihara
Robert Nishihara
7 months
This is the first hands-on, intensive, two-day bootcamp for learning to build RAG applications. Cohosted by @pinecone and @anyscalecompute (also featuring lessons from experts at @LangChainAI , @vercel , and others). Nearly every AI application will be a RAG application, and
@anyscalecompute
Anyscale
7 months
Ready to hear from #RAG experts at @LangChainAI @vercel @pinecone @anyscalecompute and get hands-on with intensive guided trainings? The 2-day RAG Developer Bootcamp is for you! Learn more & register now 👉 #llm #ml #rag #ai #vectordatabase #ray #pinecone
Tweet media one
4
16
56
10
24
116
@robertnishihara
Robert Nishihara
1 month
This migration began 4 years ago 🫢 Not our typical Ray use case, but so impressive and it illustrates Ray's versatility. Also, it was worth it because they're saving over $100 million annually 😇 Some fascinating excerpts. 2016: Amazon aims to remove all dependencies on
We don't hear the term *exabyte* too frequently. This is an impressive use case.
0
12
52
5
15
117
@robertnishihara
Robert Nishihara
29 days
This is a HUGE update for us! I’ve spent a ton of time with @KeertiMelkote over the past few months and the energy is through the roof. Very very few founders have done what he’s done (starting a business in his garage and scaling it to over $5B in revenue). AI is in its
@anyscalecompute
Anyscale
29 days
Today, we’re welcoming @KeertiMelkote as CEO of Anyscale!
0
2
46
12
7
103
@robertnishihara
Robert Nishihara
10 months
Fine-tuning is here to stay. Chatbots are the most common LLM application today, but we are going to inject AI into every nook and cranny, and this will mean *many* LLM calls working in concert to power applications (some of our debugging features on Anyscale involve composing
@adithyan_ai
Adithyan
10 months
I burned in🔥2000$ in finetuning so you don't have to. I fine-tuned models with @OpenAI and @anyscalecompute API endpoints with 50million tokens. Here are the results I wish I knew before getting into finetuning. If you just want a quick snapshot, look at the figure. A longer
Tweet media one
32
78
691
1
14
94
@robertnishihara
Robert Nishihara
3 months
One of @vllm_project 's strengths is that it exposes the ability to trade off latency and throughput. However, higher qps regimes cause significant latency degradation. The underlying reason has to do with inference taking place in two stages: prefilling (processing the input
Tweet media one
Tweet media two
@anyscalecompute
Anyscale
3 months
Recently, we’ve contributed chunked prefill to @vllm_project , leading to up to 2x speedup for higher QPS regimes! In vLLM, prefilling, which fills the KV cache, and decoding, which outputs new tokens, can interfere with each other, resulting in latency degradation. 1/n
Tweet media one
4
23
97
2
14
89
@robertnishihara
Robert Nishihara
4 months
Ray originally started with just the "task" API for executing Python functions asynchronously (with some resemblance to systems like Dask, Celery, PySpark, etc). Actually, the system most closely resembling Ray's task API is CIEL (built by @mrry ). That
@raydistributed
ray
4 months
Ray operates at two levels: Ray Core, which scales Python functions and classes with tasks and actors, and its libraries, offering easy-to-use abstractions tailored for ML workloads. #Ray #ML #DistributedComputing
Tweet media one
0
9
53
2
7
84
@robertnishihara
Robert Nishihara
9 months
Open models have made astounding progress in 2023. Llama, Mistral, Zephyr, ... much more to come in 2024. Try them side by side with OpenAI.
@martin_casado
martin_casado
9 months
OS AI has never been more important. Ever.
33
102
647
0
15
78
@robertnishihara
Robert Nishihara
10 months
If you are investing in LLM infrastructure and wondering what the most cost efficient way to run open LLMs is, check out Anyscale Private Endpoints! It's like OpenAI, but 🎯 for open models (Llama-2, Mistral, ...) 🎯 it runs in your cloud, private for your business 🎯 it's
5
10
76
@robertnishihara
Robert Nishihara
11 months
Great LLM inference survey from @huggingface and I'm delighted to see that Anyscale Endpoints is the lowest price point on the market for Llama-2-70B. 🤗
@_philschmid
Philipp Schmid
11 months
Yesterday, @awscloud released Bedrock as GA! Amazon Bedrock is a new AWS service that gives you access to Foundation Models ( @Anthropic , @cohere ,…) with a token-based pricing. 🆕 Lets compare the pricing to @OpenAI @Google and others. 🧶
8
62
226
2
16
73
@robertnishihara
Robert Nishihara
2 years
We are *beyond excited* about this and are so so lucky to be working with @DynamicWebPaige . This is an amazing development for @raydistributed ! Not many people understand ML and developers like she does.
@DynamicWebPaige
👩‍💻 Paige Bailey
2 years
❤Am beyond excited to share that this week begins a new adventure, leading developer experience for @RayDistributed at @AnyscaleCompute . is an open-source project that gives users the ability to scale, and serve, *any* compute-intensive #Python workload.
Tweet media one
30
13
265
5
6
71
@robertnishihara
Robert Nishihara
9 months
The pace of progress in the open source community is astounding. Impressive work from the Mistral team. The smaller 7B Mistral model surprised us in a lot of ways (e.g., matching GPT-3.5 in function calling quality). I'm excited to see what people build with the new 8x7B model.
@anyscalecompute
Anyscale
9 months
We’re excited to announce the official @MistralAI Mixtral 8x7B model on Anyscale Endpoints, offering the best price on the market with an OpenAI compatible API. 💸 Pricing: $0.5 / million tokens 📆 Coming soon: JSON mode and function calling Try out Mixtral on Anyscale
Tweet media one
32
75
684
1
13
69
@robertnishihara
Robert Nishihara
10 months
Nice to see @raydistributed in Amazon's (JARK) stack for generative model serving. - Jupyter - Argo - Ray - Kubernetes
Tweet media one
0
15
67
@robertnishihara
Robert Nishihara
9 months
The Llama Guard model is now available on Anyscale Endpoints. Get started here: Example:
Tweet media one
Tweet media two
@AIatMeta
AI at Meta
9 months
At release, Purple Llama includes: - CyberSecEval - Llama Guard model - Tools for insecure code detection & testing for cyber attack compliance We're also publishing two new whitepapers outlining this work. Get Purple Llama ➡️
Tweet media one
5
17
63
2
22
58
@robertnishihara
Robert Nishihara
2 years
Pretty astounding talk from @DhruvMadeka at @NeurIPSConf about how @amazon was able to use @raydistributed to optimize inventory. This appears to enable a 12% reduction in inventory across Amazon 😮😮🤯
4
9
56
@robertnishihara
Robert Nishihara
3 months
One of the worst bugs that @pcmoritz and I debugged was a deadlock involving multiple processes across multiple machines each waiting on each other. We sat in a conference room and shared our screen on a massive projector. We had 6 terminal windows open, each one ssh'ed to some
@raydistributed
ray
3 months
🚀 Announcing the Ray Distributed Debugger! 🚀 An integrated debugging experience within VSCode. 1⃣ Set breakpoints to pause tasks and inspect variables. 2⃣ Post-mortem debugging: Analyze state after an error. More:
0
10
39
3
5
54
@robertnishihara
Robert Nishihara
1 year
This is very impressive. "Introducing Ray Serve dramatically improved our production ML pipeline performance, equating to a ~50% reduction in total ML inferencing cost." 💰💰 @Samsara will be speaking in depth about how they scale AI in a cost-efficient manner at #RaySummit .
1
9
51
@robertnishihara
Robert Nishihara
4 months
Ray is often compared to systems like Apache Spark, but Spark is more analogous to Ray Data, Ray's data processing library. Ray is architected as an ecosystem of libraries (for data, training, inference, etc) built on top of a flexible core system.
@raydistributed
ray
4 months
Ray is emerging as a standard for AI workloads, powering AI at companies like OpenAI, Uber, and Netflix. What sets Ray apart is its rich ecosystem of libraries tailored for various distributed computing tasks across the AI lifecycle.
Tweet media one
0
8
31
1
8
51
@robertnishihara
Robert Nishihara
10 months
We updated our production RAG application guide with a number of new sections: ☑️ When to fine-tune embeddings ☑️ When to augment vector-based retrieval with traditional lexical search ☑️ When to rerank retrieved context ☑️ How to update & reindex as data changes Importantly,
@GokuMohandas
Goku Mohandas
10 months
Added some new components (fine-tuning embeddings, lexical search, reranking, etc.) to our production guide for building RAG-based LLM applications. Combination of these yielded significant retrieval and quality score boosts (evals included). Blog:
Tweet media one
Tweet media two
Tweet media three
Tweet media four
7
50
211
0
14
49
@robertnishihara
Robert Nishihara
1 year
A big roadblock with adopting LLMs is just "getting started". A lot of businesses are rethinking ML infrastructure that wasn't designed to support LLMs. Ray 2.4 makes it straightforward (copy & paste) to get started with LLMs - fine-tuning / training - serving - batch inference
Announcing Ray 2.4.0: Infrastructure for LLM training, tuning, inference, and serving. 🧠 LLM features 💽 Ray data for ease of use & stability 📊 Serve observability 🤖 RLlib’s module for custom reinforcement learning 🏢Ray scalability for large clusters
0
40
166
2
6
47
@robertnishihara
Robert Nishihara
4 months
This is a fantastic read on Uber's AI 8 year AI journey. From (1) predictive ML on tabular data to (2) adopting deep learning to (3) venturing into generative AI. It's amazing to see that @raydistributed has played a role in enabling deep learning and LLM training at Uber.
@UberEng
Uber Engineering
4 months
Learn about @Uber 's journey from predictive to generative AI, all while supporting 10 million real-time predictions per second at peak. Read more: #UberEngineering #UberEng
1
10
32
2
9
46
@robertnishihara
Robert Nishihara
4 months
Data ingest is a bottleneck for training (as you scale). @metaai wrote a great blog a while back outlining these challenges and how they solved them (by scaling data ingestion and training independently). As far as I know, Ray is the only open source framework that enables
Tweet media one
Tweet media two
@raydistributed
ray
4 months
Here is part 2, zooming way in on data preparation (along with runnable code).
2
6
19
0
10
45
@robertnishihara
Robert Nishihara
11 months
Open source models will dominate. We've been betting on open source infrastructure at @anyscalecompute from the start.
@ylecun
Yann LeCun
11 months
Open source AI models will soon become unbeatable. Period.
146
500
3K
0
3
45
@robertnishihara
Robert Nishihara
11 months
Ray started out primarily with training workloads. Since then, serving workloads have taken off, especially as cost efficiency for LLM inference is top of mind for everyone 🪙🪙
@raydistributed
ray
11 months
🎉 Announcing Ray Serve and Anyscale Services general availability! Teams at @LinkedIn , @Samsara , @AntGroup + many more have been using Ray to serve LLMs & multi-modal applications in a flexible, performant and scalable way. Read more about the GA release and how companies have
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
17
55
1
4
44
@robertnishihara
Robert Nishihara
1 year
Aviary is an open source project that makes it easy to deploy and manage multiple LLMs, especially any @huggingface LLM. Adding a new model (like LightGPT) takes 5 minutes. And new models can be contributed by anyone in the open source community!
1
9
42
@robertnishihara
Robert Nishihara
4 months
This blog covers some of the low level details of optimizing training performance (in this case for stable diffusion models, though the lessons are broader). 💽 Mixed hardware setup (A100 and A10g GPUs) 💰 Decouple encoders from U-Net 🏗️ EFA? torch.compile? FSDP? NCCL plugins?
@raydistributed
ray
4 months
We pretrained a stable diffusion model on 2 billion images for under $40K. Here's what we learned.
1
16
51
1
6
43
@robertnishihara
Robert Nishihara
3 months
The core Ray API has remained the same since we first built it (with roughly two major API additions), but the library ecosystem on top has pivoted wildly. The vision was always somewhat analogous to the Python ecosystem: (1) build flexible lower level primitives (like functions
@anyscalecompute
Anyscale
3 months
Ray was originally envisioned as a scalable version of Python. The analogy applies at two levels. 1⃣ Python’s core primitives are functions and classes. Ray’s core primitives are tasks and actors, which map these concepts into the distributed setting. 2⃣ Python’s strength is in
Tweet media one
0
5
19
0
18
40
@robertnishihara
Robert Nishihara
6 months
Saturday morning is the best time to learn about RAG.
Tweet media one
@anyscalecompute
Anyscale
6 months
🚀Join @Marwan1112 for a training on evaluation-driven development! ✅ Develop a basic #RAG app w/ Python ✅ Create an evaluation dataset ✅ Evaluate retrieval & overall quality ✅ Explore trade-offs between quality & cost Space is limited - sign up now
0
1
2
2
3
40
@robertnishihara
Robert Nishihara
8 months
More function calling! This time with Mixtral.
@anyscalecompute
Anyscale
8 months
🔥 Mixtral-8x7B JSON Mode and Function Calling API is now available on Anyscale Endpoints! Empirically, we observed noticeable improvements in response to tool messages by Mixtral MoE, compared @MistralAI 7B. 🚀 👇 Try it out:
Tweet media one
Tweet media two
5
12
104
1
3
41
@robertnishihara
Robert Nishihara
8 months
Curious how LLM providers compare on performance (e.g., AWS Bedrock, Fireworks, Replicate, Together, Anyscale)? Two key metrics: 🚅 Time to first token 🚢 Inter-token latency And of course, end-to-end latency can be derived from these two numbers. Importantly, the code and
@anyscalecompute
Anyscale
8 months
📈We’re excited to introduce the LLMPerf leaderboard: the first public and open source leaderboard for benchmarking performance of various LLM inference providers in the market. Our goal with this leaderboard is to equip users and developers with a clear understanding of the
Tweet media one
10
39
161
3
13
38
@robertnishihara
Robert Nishihara
3 months
For people who started doing ML over ten years ago, some of these trends are big shifts. One of the most interesting is the role of AI in dataset preparation. In training, the quality of the resulting model often hinges on the quality of the training data, which is why a ton of
@robertnishihara
Robert Nishihara
3 months
The core Ray API has remained the same since we first built it (with roughly two major API additions), but the library ecosystem on top has pivoted wildly. The vision was always somewhat analogous to the Python ecosystem: (1) build flexible lower level primitives (like functions
0
18
40
2
5
39
@robertnishihara
Robert Nishihara
5 years
Our startup, @anyscalecompute , recently announced our initial round of funding. We're working in the ML and distributed systems space and looking for a wide variety of roles! If you're at all interested in working together, please send me a message!
4
2
39
@robertnishihara
Robert Nishihara
3 months
We're adding a new vLLM track to Ray Summit. The contributor community around vLLM has exploded recently, and vLLM is one of the frameworks most commonly used along with @raydistributed . Submit talk proposals here:
@anyscalecompute
Anyscale
3 months
There has been so much excitement and activity around this topic, that we are adding a vLLM track to the Ray Summit! If you contribute to or use @vllm_project , we want to hear from you.
Tweet media one
1
12
32
0
11
38
@robertnishihara
Robert Nishihara
10 months
@HamelHusain It's an impressive launch, but this is just the start 😀 One year from now, all of the LLMs we have today will appear to have been prohibitively expensive (and slow). From our view at @anyscalecompute , there's a massive amount of energy going into the open source ecosystem, and
3
0
36
@robertnishihara
Robert Nishihara
1 year
I still can't believe how lucky we are to be working with @GokuMohandas 😍🥳 He's one of the best educators out there, and @MadeWithML has been the entry point for so many into AI 🧑‍🏫🎗️ If you're looking to get started with AI, this course is perhaps the only one out there that
@GokuMohandas
Goku Mohandas
1 year
Beyond excited to share the biggest update yet to @MadeWithML -- based on 8+ years of helping machine learning teams get to production: - 📈 Scaling ML (+ LLMs) - 🔗 MLOps integrations - 🚀 Dev → Prod (fast) - ✅ open-source 🧵 Detailed thread below 👇
14
89
318
2
6
37
@robertnishihara
Robert Nishihara
2 months
FP8 for @vllm_project cuts latency nearly in half.
@anyscalecompute
Anyscale
2 months
We’ve recently contributed FP8 support to the @vllm_project in collaboration with @neuralmagic . With this feature, you can see up to a 1.8x reduction in inter-token latency, with >99% accuracy preservation! 1/n
Tweet media one
2
35
105
0
9
38
@robertnishihara
Robert Nishihara
11 months
The typical path is - S3 -> disk (streamed via CPU memory) - disk -> CPU memory - CPU memory -> GPU memory There are a couple performance issues here. 1. The detour through disk is out of the way. 2. These stages are typically done sequentially (one completes before the next
2
6
38
@robertnishihara
Robert Nishihara
5 months
The vast majority of businesses are in the "exploration" phase of adopting generative AI. @canva is one of a few companies that have shipped and scaled multiple generative AI products. Amazing to see the work they are doing 🤩
@raydistributed
ray
5 months
Very impressive to see how @canva is using LLMs and image generation to transform the design world.
Tweet media one
Tweet media two
1
7
20
0
7
37
@robertnishihara
Robert Nishihara
11 months
Congratulations to the @MistralAI team! This is a huge win for open source LLMs, especially given the Apache 2 license 🥰 For anyone looking to do inference or fine-tuning with open LLMs, check out Anyscale Endpoints (we serve Llama-2-70B for $1 / million tokens).
@GuillaumeLample
Guillaume Lample @ ICLR 2024
11 months
Mistral 7B is out. It outperforms Llama 2 13B on every benchmark we tried. It is also superior to LLaMA 1 34B in code, math, and reasoning, and is released under the Apache 2.0 licence.
Tweet media one
52
481
3K
0
2
35
@robertnishihara
Robert Nishihara
6 months
The future of marine transit 😍 When electric hydrofoiling boats take off, eventually people will move closer to the water. Imagine commuting from Sausalito to Alameda in 10 min. @navierboat
3
3
34
@robertnishihara
Robert Nishihara
1 year
LLM throughput improvements translate directly to cost reductions, so 23x is huge (and we have a lot more in the pipeline). The key ingredients here are continuous batching and PagedAttention.
@cdnamz
Cade Daniel 🇺🇸
1 year
I wrote about a 23x improvement (!) in LLM live-inference throughput, measured on OPT-13B on A100. There are 2 new innovations which make this possible: Continuous batching & PagedAttention. Short thread below; see writeup, experiments, and results at
Tweet media one
2
52
244
0
6
35
@robertnishihara
Robert Nishihara
2 months
Lots of deep lessons from the AI infra journey at Pinterest going back to Q1 2023. Some impressive stats - 5000+ training jobs / month - 300+ batch inference jobs / month
Tweet media one
@PinterestEng
Pinterest Engineering
2 months
Part 2 is here! 🥳 Learn about ‘Ray Infrastructure at Pinterest’ written by Chia-Wei Chen, Raymond Lee, Alex Wang, Saurabh Vishwas Joshi, Karthik Anantha Padmanabhan and Se Won Jang. 📌
0
19
53
0
6
33
@robertnishihara
Robert Nishihara
1 year
Did you know you can fine-tune and deploy the 70B parameter Llama-2 model *without thinking about compute infrastructure* 🤯🤯🤯 Would be 100x harder without the incredible platform we've been building with @raydistributed and @anyscalecompute 😍
@CyrusHakha
kourosh hakhamaneshi
1 year
My typical schedule when a new kick-ass OSS LLM gets released ( #LLama -2): Drop everything non-customer related and go all in playing with it. Here is the result of me playing with it: On this PR, I have a single script that uses #Ray Train and #Deepspeed
Tweet media one
3
20
74
2
5
34
@robertnishihara
Robert Nishihara
1 year
Very very excited that @RichardSocher will be speaking at #RaySummit next week! AI search is moving incredibly quickly, and @YouSearchEngine is a leader in this area. This will be a good place to learn about some of the technology powering 💡 how AI
@robertnishihara
Robert Nishihara
1 year
Ray Summit this month will be 🔥🔥 🤯 ChatGPT creator @johnschulman2 🧙‍♀️ @bhorowitz on the AI landscape 🦹‍♂️ @hwchase17 on LangChain 🧑‍🚀 @jerryjliu0 on LlamaIndex 👨‍🎤 @zhuohan123 and @woosuk_k on vLLM 🧜 @zongheng_yang on SkyPilot 🧑‍🔧 @MetaAI on Llama-2 🧚‍♂️ @Adobe on Generative AI in
8
45
207
0
8
33
@robertnishihara
Robert Nishihara
8 months
One of the lesser known challenges with building RAG applications is figuring out how to compute the embeddings. Embeddings of text, images, and other data modalities are at the heart of a RAG app. Producing ~1B embeddings can take weeks and cost tens of thousands of dollars
@anyscalecompute
Anyscale
8 months
Producing ~1B embeddings can take weeks and cost tens of thousands of dollars ($60K with OpenAI in the example below). We are thrilled to partner with @pinecone on the launch of their new serverless offering! Anyscale + Pinecone reduce the cost of computing these embeddings by
3
8
59
2
6
32
@robertnishihara
Robert Nishihara
2 months
I've come across very few open source projects which ship weekly releases (though they do exist). Ironically, while it takes a lot of work to get to a constantly releasable state, it's less work to maintain. Previously, we were following a six-week release cycle. Every release
Tweet media one
@raydistributed
ray
2 months
We recently moved to weekly Ray releases to ship features to our community faster. 🚀 Doing so required us to fix flaky tests and completely revamp our release process. 👊 Read more here:
1
3
10
0
7
32
@robertnishihara
Robert Nishihara
16 days
Something we're doing differently this time around, we added a #vLLM track to #RaySummit ! @vllm_project is one of the most popular inference engines, and is often used together with @raydistributed for scaling LLM inference. Can't wait to hear from these companies about how
Tweet media one
1
8
32
@robertnishihara
Robert Nishihara
1 year
Our team has worked incredibly hard to make Ray the easiest way to scale LLMs and generative AI. A sample of recent developments ✅ Streaming support for real-time LLM inference ✅ Distributed checkpointing for large model training ✅ Tutorials for LLM fine-tuning and serving
The Ray 2.6.1 released with : 🎏 Streaming responses in Serve for real-time capabilities 🎏 📀🏃‍♀️Ray Data streaming integration w/Train 🏃‍♀️☁️Distributed Training & Tuning sync with cloud storage persistence 🤖 Alpha release of the Multi-GPU Learner API 📙 Ray Gallery examples 👇
1
3
22
2
7
32
@robertnishihara
Robert Nishihara
1 year
Curious how LoRA compares to full-parameter fine-tuning? The principal trade-off with LoRA is straightforward: you may give up some model quality, but you gain the ability to serve many models more efficiently. The question is how much quality. The answer is nuanced. 👇
@CyrusHakha
kourosh hakhamaneshi
1 year
🤔 Fine-tuning LLMs: LoRA or Full-Parameter? Which should you choose? Uncover the insights in our latest technical blog. 🔗 Link: 🧵 Thread (1/N) 👇
4
44
212
1
1
32
@robertnishihara
Robert Nishihara
9 days
1000 contributors!!! 🤯🤯 One of the earliest contributors to Ray was @AntGroup / @Alipay . They were the first serious Ray user, and contributed a lot to the hardening of Ray in production. Today, they use Ray for a huge range of workloads ranging from batch inference to model
🚀Ray just hit 1000 contributors!  Our community is thriving 🙌—all thanks to you! Let’s keep pushing the boundaries of AI together!🤝🌟 #opensource #AI
Tweet media one
1
5
17
1
5
32
@robertnishihara
Robert Nishihara
1 year
@Instacart just revamped their ML infrastructure. 🏆 The primary use cases are: 1⃣ Training thousands of small to mid-sized models 2⃣ Data parallel deep learning with large datasets 3⃣ Scalable batch inference
1
9
32
@robertnishihara
Robert Nishihara
3 months
We published a 3 part series on stable diffusion Part 1: Cost reduction Part 2: Preprocessing billions of images Part 3: Training at scale
@raydistributed
ray
3 months
See how we optimized large-scale ML training in Part 3 of our Stable Diffusion series! We used Ray Train, Ray Data, and PyTorch Lightning to train on 2B images with fault tolerance, data streaming, and advanced strategies like FSDP and DDP. Read more:
1
10
30
0
9
30
@robertnishihara
Robert Nishihara
2 years
I hate watching myself talk, but I'm immensely proud of this demo that our team put together at the #RaySummit showing how to build an ML application with @raydistributed and @anyscalecompute . 😍
1
3
28
@robertnishihara
Robert Nishihara
2 years
Step-by-step instructions for building an ML platform using @raydistributed and @kubeflow on @googlecloud . Ray scales compute for data ingest / preprocessing, training, serving, etc. Kubeflow provides notebooks, orchestration, authentication, and more.
1
3
30
@robertnishihara
Robert Nishihara
8 months
@ocolegro HF is a great choice to host the data. To run embedding computations (or training or inference), we do this regularly with @raydistributed and @anyscalecompute . Here's an example where ByteDance runs embedding computations and inference on 200TB data.
1
7
30
@robertnishihara
Robert Nishihara
1 year
The Llama team has changed the AI landscape in a very short period of time. It's remarkable work! 🤯
The team @MetaAI has done a tremendous amount to move the field forward with the Llama models. We're thrilled to collaborate to help grow the Llama ecosystem.
2
18
87
1
4
29
@robertnishihara
Robert Nishihara
8 months
@soumithchintala @anyscalecompute Thanks @soumithchintala , I agree with a lot of your feedback, and we’re going to address it! A few concrete things: ⚫️ We will be adding cost as a metric (that's extremely important). ⚫️ We will be measuring latency and reliability over time. This one is a big deal. As you
1
0
28
@robertnishihara
Robert Nishihara
25 days
Many of the companies you'll hear from at #RaySummit have gone through a 5-10 year AI infrastructure journey (even longer in some cases). They've managed the migration from classical ML to deep learning, then from deep learning to generative AI, and they are gearing up for the
Tweet media one
2
9
28
@robertnishihara
Robert Nishihara
3 years
This is a nice walkthrough of how to scale applications on Kubernetes with Ray. Blog post by Vishnu Deva from @MadStreetDen .
Tweet media one
1
9
27
@robertnishihara
Robert Nishihara
2 months
Gen-3 Alpha is mind blowing. Absolutely cannot wait to hear from @agermanidis about the process of building this model and the hard-earned lessons at Ray Summit 2024.
@runwayml
Runway
2 months
Introducing Gen-3 Alpha: Runway’s new base model for video generation. Gen-3 Alpha can create highly detailed videos with complex scene changes, a wide range of cinematic choices, and detailed art directions. (1/10)
254
930
4K
0
6
28
@robertnishihara
Robert Nishihara
1 month
GPU fragmentation is common when deploying multiple models on a shared pool of resources (in particular, when the number of replicas of each model scale up and down independently).
@anyscalecompute
Anyscale
1 month
4/ ↪️ With Replica Compaction, Anyscale will automatically migrate replicas into fewer nodes in order to optimize resource use and reduce costs. It does this with zero downtime to ensure applications have no interruption in traffic.
Tweet media one
1
1
4
0
7
27
@robertnishihara
Robert Nishihara
9 months
In addition to today's launch of @AIatMeta 's new Llama model (Llama Guard), Anyscale Endpoints also recently launched open embeddings (gte-large). Lots more in the coming weeks...
Tweet media one
0
3
25
@robertnishihara
Robert Nishihara
1 year
1⃣ Cost-efficiency at scale means using smaller specialized models (versus massive general purpose models). 2⃣ High response quality with smaller models means customization (e.g., fine tuning). 3⃣ Customization is best done with open models. 4⃣ The best open models are the Llama2
Tweet media one
1
8
26
@robertnishihara
Robert Nishihara
4 months
When scaling training on GPUs, it's common to become bottlenecked by data ingest and preprocessing. When training stable diffusion models, that "preprocessing" may include running other models to embed the text & image inputs (a pre-trained VAE and a text encoder (OpenCLIP-ViT/H)
Tweet media one
@raydistributed
ray
4 months
We pretrained a stable diffusion model on 2 billion images for under $40K. Here's what we learned.
1
16
51
1
10
26
@robertnishihara
Robert Nishihara
15 days
Running hands on training sessions is something we started in the very early days of Ray. We got this idea of course from the @ApacheSpark community. Here are two photos from the early days 1. The first Ray meetup, hosted at @OpenAI . 2. A tutorial we ran at an @OReillyMedia
Tweet media one
Tweet media two
🚨The Ray Summit Training Guide is here!🚨 Check it out and make the most out of your experience. 🎓 Explore all the classes available, 🍎 Discover personalized learning paths (LLMs, Ray Core, and more), 🎯 Reserve your spot early to avoid missing out! 🚀 #RaySummit Training
0
7
11
2
7
26
@robertnishihara
Robert Nishihara
1 year
Please submit questions you’d like to ask @bhorowitz ! We’re doing a fireside chat at #RaySummit . Topics include AI, the chip shortage, the role of open source, management, and policy / regulation. Just reply to this tweet!
Tweet media one
5
9
26
@robertnishihara
Robert Nishihara
4 years
Congrats to the team! We're hiring *across the board* engineering (including ML, infrastructure, frontend, etc) site reliability engineering product management product design marketing engineering management solutions architect More details at .
@anyscalecompute
Anyscale
4 years
We’re thrilled to announce our $40M Series B, led by @NEA to continue growing the ecosystem around @raydistributed . This wouldn’t have been possible without the Ray community!
2
12
45
2
6
26
@robertnishihara
Robert Nishihara
3 years
Really excited to announce GA for Anyscale as well as our Series C (with @a16z and Addition)! Our goal is to enable every developer and every team to succeed with AI without needing to worry about building and managing infrastructure.
@anyscalecompute
Anyscale
3 years
🎉 We’re thrilled to announce our $100M Series C led by @a16z & Addition + general availability of the Anyscale managed @raydistributed platform! Both are big steps forward in our mission to accelerate the scaling and productionization of #AI apps.
4
16
89
3
3
26
@robertnishihara
Robert Nishihara
29 days
Amazing work! We've been thrilled to partner with @lmsysorg and @Meta to host these models for the leaderboard on @anyscalecompute 😀
@lmsysorg
lmsys.org
30 days
Exciting news! @metaai 's Llama-3.1 results are here🔥 The Llama-3.1 series, extensively tested over the past week, has gathered over 10K community votes. Now, Llama-3.1-405B has climbed to #3 on the Overall Arena leaderboard, marking the first time an open model has ranked in
Tweet media one
32
117
704
3
6
25
@robertnishihara
Robert Nishihara
10 months
Exactly our approach with Anyscale Endpoints. Go deep on cost and performance. Squeeze performance out of every layer of the stack.
@Suhail
Suhail
10 months
Super simple startup idea: take open source models, speed them up relentlessly, make an api, have the cheapest possible price, great uptime. In the long-run, you’ll build economies of scale w GPUs + have a process power of optimization.
70
43
800
0
3
24
@robertnishihara
Robert Nishihara
2 months
Collaboration with @lmsysorg . Step-by-step instructions for building your own model router. Key steps: 1. Generating labeled data 2. Fine-tune an LLM-based classifier 3. Run offline evals The whole thing takes about 120 minutes. Overall goal is to direct "simple" queries to
@anyscalecompute
Anyscale
2 months
1/ 🚀 Introducing RouteLLM: a routing framework based on human preference data for routing queries between powerful proprietary LLMs and cost-effective LLMs, developed in collaboration with @lmsysorg . By intelligently selecting the best model for each query, our router models
Tweet media one
1
34
136
0
11
25
@robertnishihara
Robert Nishihara
1 year
Building AI applications is hard (but getting easier). @jerryjliu0 and @llama_index have done a huge amount to move AI tooling forward by solving some of the core challenges around interfacing LLMs with your data.
@llama_index
LlamaIndex 🦙
1 year
Building a production-ready LLM app is hard: 📄 How to load, parse, embed thousands of docs? ⚙️ How to deploy to prod? We’re incredibly excited to collab with @anyscalecompute : Ray can make LlamaIndex 10x faster + easily deployable to a prod server ⚡️
6
52
193
0
4
25