As someone who’s worked on the infrastructure to accelerate autonomy. It was still surreal to see everything come together in one coherent and clear vision of the future.
Looking forward to bringing this closer to reality!
Reflecting on the We, Robot event, we had:
- 19 Cybercabs and 29 Model Ys driving themselves
- 1,300 trips transporting over 2,000 guests
- Continuous operation of over the 3.5 hours
- And every trip was perfectly safe!
Optimus can now sort objects autonomously 🤖
Its neural network is trained fully end-to-end: video in, controls out.
Come join to help develop Optimus (& improve its yoga routine 🧘)
→
Tesla AI is building next-generation autonomy on a single foundation video network that directly drives the car
Join the team and build state-of-the-art end-to-end models using massive fleet data on one of the world's largest training clusters
@Tesla
prolly only company with all the pieces from AI training infra, efficient edge inference, software, hardware and batteries “under one roof” to make humanoid robots happen.
Excited for the future!
2023 has been awesome for Optimus.
We’ve moved from an exploratory prototype (Bumblebee/cee) to a more stable, Tesla-designed platform (Optimus Gen-1).
We’ve improved our locomotion stack, frequently walking off-gantry without falls and with a faster, increasingly more
The variability and complexity of driving in the real world needs lots of data, compute and a powerful model.
Come build the infrastructure for scaling the data, compute and model!
Lots of hard work and many late nights went into the making of FSD 12.5, from across the team.
Many ideas were simplified and re-worked from first principles.
Hope everyone has a chance to try it out. It's a release we're proud of.
MLX from
@Apple
is an interesting mix of ideas/learnings from numpy/Jax/pytorch. Things that stood out to me:
- very simple and clean implementation, can easily look under the hood and see how things work
- lazy eval of the compute graph -> leaves the door open for all kinds of
The 2023 Holiday Update rolls out next week
Here’s what’s coming...
Custom Lock Sounds
Replace the horn lock sound of your vehicle with another sound—like a screaming goat 🐐
LAN Party on Wheels
Play your favorite games on the rear touchscreen 🎮
Rear Screen Bluetooth
Also come work with me!
“Platonic Ideal” of Machine Learning, Software Engineering and AI Accelerators.
We dive deep on all layers of the stack to run networks on car and bot with high accuracy and low latency.
Tesla AI is building next-generation autonomy on a single foundation video network that directly drives the car
Join the team and build state-of-the-art end-to-end models using massive fleet data on one of the world's largest training clusters
Quite impressive the rapid progress we’ve seen in neural reconstruction!
Good job
@xiuming_zhang
@philduan
and team, one of the coolest feature shipped this year!
Lots of exciting work to be done towards a bright and autonomous future!
If you’re excited about the intersection of ML, Systems and Infrastructure, consider applying
@Tesla_AI
Tesla 2023 recap
Made possible by the hard work of our amazing teams around the world, and each of our owners & supporters.
Thank you for helping us continue to accelerate the transition to sustainable energy!
We put a lot of effort to accurately and efficiently model the quantization of our int8 accelerator during fp16 training.
We also develop novel techniques/tools/evals to identify and mitigate the sensitivity of our networks to inference time quantization.
If you’re excited by
@Scobleizer
An accurate assessment.
What is also mindblowing is that the inference compute power needed for 8 cameras running at 36FPS is only about 100W on the Tesla-designed AI computer. This puny amount of power is enough to achieve superhuman driving!
It makes a big difference that we
3 points in the AI Accelerator design space: GPU, TPU-like and spatial dataflow. My best attempt to compare them:
GPU (
@nvidia
@AMD
): most common, lots of HW threads + HW thread scheduler for zero cost context switching. High HW complexity which enables lots of
Had so much fun chatting with my friends
@TrentonBricken
and
@_sholtodouglas
.
No way to summarize it, except:
This is the best context dump out there on how LLMs are trained, what capabilities they're likely to soon have, and what exactly is going on inside them.
You would be
@tim_zaman
Thanks for everything Tim! You took AI Infra to new heights at Tesla.
It was a pleasure working with you, I certainly learned a lot. Looking forward to what you do next!
Cool to see model visualization and debugging tools being open sourced!
We (I suspect many others) have built similar tools for better understanding quantization/numerics, latency, memory and many other details of our models as part of the deployment process.
Presenting Model Explorer, a novel graph visualization tool that streamlines the deployment of large models to on-device platforms, such as mobile phones and browsers, where visualizing conversion, quantization, and optimization data is especially useful.
A fermi estimate of AI Inference compute deployed by
@Apple
in a year:
Around 150M iPhones (13/14/pro/pro-max) with A15/A16 chip with an average 16TOPs of (FP16?) compute = 2400 Exa-Flops inference compute.
For comparison 2M H100s (sales estimate for 2024) = 4000 BF16
Examining neural net design/architecture through the lens of information and compression (bits in/bits out, size of feature bottlenecks, bits per param, bits per sample) is often insightful and helps eliminate un-desired bottlenecks.
Managing a single supercomputer is hard.
Now imagine having a lot - we're looking for fullstack engineers to help tie everything together into our machine learning platform (gui, tooling, job/model observability, health, etc). js+py. Interested? DMs open
To build great AI infrastructure you have to do some AI.
Occasionally going up abstraction layers to build datasets, train models and construct evals leads to better infrastructure.
MCM (Multi-chip Module) GPU:
2017 paper from
@nvidia
on scaling up gpu with MCM.
We’re only now (4-6 years later) starting to see MCM based gpus in production.
- reticle limits and silicon yield upper-bound the practical size of a single die
- must
Would be interesting to see the distribution and magnitude of “bits” ingested/processed by sensor modality and how this has shifted and increased over HW generations.
We recently introduced our 6th-generation Waymo Driver. Today, our Vice President of Engineering, Satish Jeyachandran, shares insights into how this next-gen system will drive Waymo's business forward.
So excited that FSD customers were able to catch a glimpse of what we’ve been working on. Sincere thanks to my colleagues for the many long nights and hard work that led up to this point.
Want to build end to end with us? Join the team
@Tesla_AI
!
Dug up my thesis from a “past life”
@UofT
We did research at the “bottom of the stack” on high speed die-to-die IO.
Particularly relevant for scaling AI Accelerators using advanced packaging (MCM or Silicon Interposer) like
@nvidia
grace hopper,
@AMD
@itsclivetime
@itsclivetime
it was a pleasure working together the last 2 years, on data pipelines, perf optimizations, quantization, low precision training…. You’re certainly one of the smartest and most creative people I’ve ever worked with. I will miss our chats about ML/HW and everything
@soumithchintala
Agree with this, everyone serious about AI is working hard to decouple themselves from $NVDA and 70-80% margins.
Expect good returns in the near term but things will look different 2-3+ years out, as alternative SW stacks mature and internal AI Accelerator efforts mature.
Introducing AlphaGeometry: an AI system that solves Olympiad geometry problems at a level approaching a human gold-medalist. 📐
It was trained solely on synthetic data and marks a breakthrough for AI in mathematical reasoning. 🧵
The LLM Surgeon
paper page:
State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it
@DrJimFan
Amazing work!
Always felt like this was one of NVIDIA’s biggest edges. Which other company making a HW accelerator has a world-class research org.
Anyone interested in learning about the nuts and bolts of an AI inference framework should read
@ggerganov
Single file (15K+ LOC 😮💨) with everything: tensor memory layout, quantization, vector instructions, op implementations, dispatching and more
Awesome work by the Tesla team! Real world deployment of learning based approaches is very hard.
Requires tight integration and excellence across all parts of the stack both HW and SW. Tesla is in a unique position to make this dream a reality.
Join the team!
So many interesting systems problems to be solved in LLM inference.
Very reminiscent of all the distributed systems that needed to be built to run current web and search infrastructure.
How to enjoy the best of both worlds of efficient training (less communication and computation) and inference (constant KV-cache)?
We introduce a new efficient architecture for long-context modeling – Megalodon that supports unlimited context length. In a controlled head-to-head
This is the first pass on the new chapter for ML Engineering:
The AI Battlefield Engineering - What You Need To Know
This a WIP and your feedback for improvement is always welcome.
With many 🧩 dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System. E.g. today it orchestrates:
- Input & Output across modalities (text, audio, vision)
- Code interpreter, ability to write & run
Zero Bubble Pipeline Parallelism
paper page:
Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling
PyTorch 2.2 is here 🎉
Featuring:
- SDPA support of FlashAttention-2
- New ahead-of-time extension of TorchInductor
- device_mesh, a new abstraction for initializing and representing ProcessGroups
- A standardized, configurable logging mechanism called TORCH_LOGS
When is
@Apple
gonna put an LLM on the iPhone periodically fine tuned locally (on device) with some retrieval.
Would be the ultimate personal assistant + personal search engine.
Gemini Nano is super efficient for tasks that are on-device. Android developers can sign up for an early access program for Gemini Nano via Android AICore and Pixel 8 Pro users can already see it rolling out in features like Summarize in Recorder and Smart Reply in Gboard + much
Most social media (i.e
@Twitter
/
@Reddit
) news and content platforms should be training/fine-tuning LLMs in-house. Replacing the now decades old database technologies and indexes that currently power their search.
Very encouraging to see the steady increase of viable hardware options that can handle various AI models.
At the beginning of the year, there was only one practical option - nVidia. Now we see at-least 3 vendors providing reasonable options. Apple, AMD and Intel. We have been
@ysmulki
did awesome work with us!
He pushed several parts of our stack forward and laid the foundation for some new and exciting areas of exploration. A lot of impact over a short period of time 👏
excited to share that I’ll be interning at
@Tesla_AI
working on Autopilot ML infra this summer! HMU if you’re in SF or the Bay Area and want to grab a coffee
Bard is much better for querying and summarizing recent research than chatgpt. Partly due to the 2021 cutoff in training data and possibly due to better retrieval in bard. Maybe this is more even with browsing enabled.
AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI!
The MX Alliance has released the Microscaling Formats (MX) Specification v1.0 in an open, license-free format.
Read the story here:
@karpathy
Can’t wait for most DL optimizations and perf gains to happen automagically in the compiler. The current state of affairs for cpu programs. Feels like we’re still in the early days.
@ylecun
First in-house AI accelerator from
@MetaAI
interesting design point.
Seems like it will immediately provide value on business critical workloads while establishing the chip design and compiler foundations needed internally.
Expecting fast follow ups here.
Modern AI really built on a stack from NVIDIA that almost no one really understands. Spend too long peering into all the drivers and distributed libraries and you get swallowed. Anyone know the magic env vars to chant to get sharp working ofed? 🥲
We just introduced PyTorch 2.0 at the
#PyTorchConference
, introducing torch.compile!
Available in the nightlies today, stable release Early March 2023.
Read the full post:
🧵below!
1/5
@itsclivetime
@Apple
Ya conservative estimate, also A17 is 2x more compute at 35TOPs!
Also even for on-device inference Apple has yet to make the Neural Engine easy to run models on. Most current inference frameworks still use the CPU/GPU.
If I was Apple I would push hard on building a great/easy
@ezyang
Two things that come to mind for me:
- ML workloads are synchronous and and read only whereas traditional distributed systems must critically support async updates to shared state (i.e distributed databases).
- there isn’t a significant understanding yet of gpu/accelerator