You can tell the authors of the new room temperature superconductor paper believe in their results because they published two papers about it concurrently: one with six authors, and one with only three.
The Nobel prize can be shared by at most three people.
7/ There will be way more software engineers in the future than the present. The job will just be very different: more English, less boilerplate coding. Engineers will adjust, like they did for the transition from assembly to Python.
5/ In this new world, every engineer becomes an engineering manager. You will delegate basic tasks to coding agents, and spend more time on the higher level parts of coding: understanding the requirements, architecting systems, and deciding what to build.
1/ Models will be extraordinarily good at coding, very soon. Research labs are investing more in coding + reasoning improvements than any other domain for the next model generation. Their efforts will bear fruit.
5/ Animal trials with this new protein will start soon. Still a long way to an approved drug, but it’s exciting to see drug discovery go faster with a neural network in a Colab notebook.
6/ This will lead to an era of unprecedented Software Abundance. Software has historically been difficult and expensive to create. It will soon be 10x more accessible. We will see a proliferation of “single use software”—one-off apps and websites that are only now viable.
4/ AlphaFold unlocked a different approach: it found the 3D structure of the existing protein & receptors, which was unknown. With the structure + another ML model, they saw how binding affinities would change with different mutations. This led to an ideal candidate in 8 hours.
2/ Why? Besides general AI progress, coding specifically has a unique advantage: potential for superhuman data scaling via “self play”. Models can write code, and then run it. Or write code, write a test, and check for self-consistency.
9/ For one, companies that market to developers will soon start “marketing” to coding agents as well. After all, your agent might decide what cloud you use and which database you choose. Agent-friendly UI/UX (often: a good CLI) will be prioritized.
4/ As a result, software engineering will look radically different in a few years. True coding agents, which do tasks end to end, will complement today’s AI copilots. The experience will look something like giving every engineer an army of interns.
3/ This type of automatic supervision is not possible in most domains, which are facing data walls in post-training as we approach the limits of human expertise. Code is different—it can be tested empirically & automatically.
14/ Coda: I’m excited to share that (in no small part, due to these predictions), I’ve joined
@cognition_labs
to help build Devin. I’ve been here >3 months, and Devin, while still early, is the first true glimpse I’ve seen of what the Software Abundance era could look like.
Startups that train their own large language models are starting to look like space companies. Raising millions in seed capital to afford a single training run (rocket launch). Better get it right.
1/ Soon, all products for creators will have embedded intelligence from massive language models (think Copilot in VSCode, DALL-E 2 in Photoshop, GPT-3 in GDocs). Companies making these products will need to roll their own massive language models or pay a tax to OpenAI/Google/etc.
We're assembling an 🚨LLM Red Team🚨 at
@scale_AI
. Come get paid to jailbreak language models with me and
@goodside
. If interested, email language-models
@scale
.com with your favorite jailbreak prompt.
I recently left
@scale_AI
. I'm so thankful to the team there and for
@alexandr_wang
's bet to acquire our startup nearly 4 years ago.
When I joined Scale, it was a single-product company building the data engine for autonomous vehicles. It's amazing to see how far Scale has come:
10/ The bar for product quality will also rise. Half-baked or feature-incomplete MVPs are less acceptable in a world where developers can ship so much faster.
A less-obvious benefit of an on-prem ML training cluster is cultural. When you own the hardware, ML engineers are incentivized to keep utilization high. With on-demand cloud instances, incentives are to minimize costs. Former leads to more experiments and better models.
Want to play with the GPT-4 API but don't have access yet?
GPT-4 API access and playground now available in Scale Spellbook:
No CC required, signup in <1 min. We will keep this open for the first 1,000 signups. Happy hacking!
Rap Battle - 🤯 demo. Pick any two people and it will generate a rap battle on the fly, using GPT-3 for lyrics, wavenet for vocals, and stable diffusion for the avatars. Sound on!
Live demo:
10/ Instead of SEO optimization, marketers will start maximizing the log likelihood of their content being generated by an ML model. This will have unexpected consequences, like marketing data poisoning attacks ().
6/ Governments will eventually realize that having the computational infrastructure to train the largest language models is essential to national security. Within a decade we will see a new Manhattan project for AI supercomputing, that makes existing clusters look like peanuts.
11/ Testing infrastructure will be much more important & prevalent with the rise of coding agents. Both because the coding agents will write more tests, and also because they will depend on these tests to check their work.
12/ Switching costs will decline as a moat for tech companies, as agents make migrations easier. Companies will even start bundling migration-assistant coding agents when you buy their products, to streamline your adoption.
2/ Over time, companies will become stratified into Compute Rich and Compute Poor. Many Compute Poor companies will become existentially dependent on the ML models of the Compute Rich.
Hosting a Clubhouse tonight 8pm PT with
@karpathy
@RichardSocher
@jcjohnss
on recent breakthroughs in AI. Will discuss image transformers, CLIP/DALL-E, and some other cool recent papers. Join at:
8/ Generative language models will slowly replace search. Why Google something when you can get the exact answer you need, embedded in the product you’re using? We see inklings of this already with things like Copilot (). This trend has many implications.
The new TF32 float format
@NVIDIAAI
announced today is a big deal. Same dynamic range as fp32, same precision as fp16, with only 19 bits total and hw-accelerated. Will be default mode of cuDNN going forward. 6x training speedup for BERT on new Ampere GPUs with no code changes!
3/ These Compute Rich companies will be the new platform gatekeepers of the coming decade. Just like Apple or FB can expel companies dependent on their ecosystems today (Epic Games, Zynga), in the future, if you lose access to your language model, your product won't function.
1/ A friend runs a biotech startup designing drugs to fight cancer. In prior work, they found that tumor cells make a protein that binds to two receptors in the body. Binding to just one of them would inhibit the tumor’s growth, but binding to both makes the tumor grow faster.
7/ The largest public AI supercomputer project in 2022 is Facebook’s AI RSC (), at ~$1B in capex. The original Manhattan project was ~$30B, the space race ~$250B adjusted for inflation. We have orders of magnitude of scale left just from increasing spend.
3/ Before AlphaFold, finding such a protein would take ~1 month: order lots of mutated DNA sequences, insert them into cells, filter the cells which bind to one receptor but not the other, and sequence those cells’ DNA.
To predict LLM progress areas, look at the new datasets. For example: Pile of Law, a 256GB of legal text dataset published in July 2022, has made it 10x easier for LLM researchers to get legal tokens. You can expect 2023’s LLMs to be enormously better at legalese.
5/ This is also why most serious AI companies are now designing their own training chips. You can either pay NVIDIA their 65% gross margins, or have each marginal dollar go ~3x as far on your inevitable billions in capex by using in-house training chips.
11/ We will also see Sponsored Outputs for language models. Advertisers will be able to pay to condition model outputs on their products. Significant research effort will one day go into v2 AdWords, now paying for likelihood that your ad is generated instead of search placement.
One weird trick for better neural nets: 𝗽𝗼𝘀𝗶𝘁𝗶𝗼𝗻𝗮𝗹 𝗲𝗻𝗰𝗼𝗱𝗶𝗻𝗴𝘀. To model high freq data, instead of scalar input x, use [sin(πx), cos(πx), ... sin((2^N)πx), cos((2^N)πx)]. Used in Transformers and now neural rendering[1]. What a diff.
[1]
New blog post: How Much Better is OpenAI's Newest GPT-3 Model?
We evaluate davinci-003 across a range of classification, summarization, and generation tasks using Scale Spellbook🪄, the platform for LLM apps. Some highlights: 🧵
4/ The most serious Compute Rich companies will aggressively secure their compute supply chains: their access to chips. Like how Tesla is buying lithium mining rights, the Compute Rich companies must also ensure they can feed the ever growing hunger of their largest models.
9/ Web properties with user-generated content will change their licensing terms to demand royalties when their data is used to train AI models. StackOverflow is valuable, but why would you visit it when your editor already knows the answer to your question?
LLMs will work way better once they're trained explicitly to attend to an external knowledge base not seen at training time, w/out fine-tuning. Memorizing Transformers and RETRO require the knowledge base at training time. RETROfit needs fine-tuning. Is anyone working on this?
Excited to share what I’ve been working on lately: Scale Spellbook — the platform for large language model apps! Some fun things I learned about LLMs while building this product: 🧵
Large language models are magical. But using them in production has been tricky, until now.
I’m excited to share ✨Spellbook🪄✨— the platform for LLM apps from
@scale_AI
. 🧵
1/ Fun fact: if you sort your model’s false positive predictions by confidence, the top results will almost always be errors in your dataset's labels. Here are some “false positive predictions” according to the labels in Berkeley DeepDrive:
The Token Hunters. At the frontier of LLM research, they scour the world for precious new tokens to add to the training set. They’ve already scraped the internet. They will scan books, transcribe songs, hire writers, make deals for fresh token supply. The drivers of AI progress.
1/ It pays to be paranoid. Bugs can take so long to find that it’s best to be really careful as you go. Add breakpoints to sanity check numpy tensors while you're coding; add visualizations just before your forward pass (it must be right before! otherwise errors will slip in).
“Realistic Evaluation of Deep Semi-Supervised Learning Algorithms” — turns out that many SSL papers undertune their baselines! With equal hyperparam budget and fair validation set size, SSL gains are often smaller than claimed.
#ICLR2018
@ESYudkowsky
@rmcentush
This play-money market has very little depth at the moment. As I write this, a single trade from a new account's free starting balance can swing the market more than 20 absolute percentage points.
What makes large companies have a strategic advantage in machine learning research & products?
Answer for the last five years: data
Next five years: compute
More & more DL research is becoming inaccessible without Google-scale datacenters.
4/ That last one is super powerful btw. It turns out to be way more impactful to make your synthetic *features* realistic than to make your synthetic *data* realistic. One older but good paper on this:
2/ It's not enough to be paranoid about code. The majority of issues are actually with the dataset. If you're lucky, the issue is so flagrant that you know something must be wrong after model training or evaluation. But most of the time you won't even notice.
(1/3) Writing great data labeling instructions is harder than programming. In programming, the edge cases are finite. In labeling, the edge cases are infinite and often completely unimaginable.
1/ Excited to share that Helia has been acquired by
@scale_AI
! Scale’s mission—to accelerate the development of AI applications—is an exceptional fit with Helia’s ML infrastructure tech and expertise. I’m so proud of team Helia for what we’ve accomplished in such a short time.
So stoked to be joining the
@tesla
autopilot team full-time in April! After looking at a dozen co's in self-driving space, convinced Tesla is the place to be. Insane team, cars, and massive fleet learning advantage. Plus, some truly wild ideas from
@karpathy
&
@elonmusk
:)
7/ Bugs are fixed faster when iteration times are faster. Impose a hard ceiling on model training time, even if you could squeeze more gain by training a bit longer. In the long run, experiment velocity >> performance of one model.
4/ You can unit test ML models, but it's different from unit testing code. To prevent bugs from re-occurring, you have to curate scenarios of interest, then turn them into many small test sets ("unit tests") instead of one large one.
1/ It’s increasingly clear that language-aligned datasets are the rate limiter for AI progress in many areas. We see incredible results in text-to-image generation and image captioning in large part because the internet provides massive language<>image supervision for free.
The
@krea_ai
team is building the Game of Life, where each alive cell is a whimsical happy Stable Diffusion image and each dead cell is an eerie, dark Stable Diffusion image, all of which evolve over time. Built on a generative AI version of Canva they made.
Next gen voice assistant with Whisper for transcription and LLMs doing document retrieval + question answering.
@mov_axbx
brought his GPU workstation for extra-low-latency edge inference 🔥
4/ Imagine if AlphaZero, beyond telling you the best chess move for a position, could also explain in English *why* that move is best, and why the more natural-seeming move is wrong. With linguistic reference to patterns that are human-interpretable.
5/ Training strategies are also better understood. If you have lots of real data and are using synthetic to cover edge cases, train jointly. With very little real data, it’s best to pre-train on synthetic and fine-tune on real. Synthetic works best as a complement to real.
6/ When someone asks “why do you think that?”, you can’t articulate the true reason: the billions neuron firing patterns in your head that led you to your result. You project this complexity down into a low-dimensional language description.
5/ Of course, such an explanation won’t be perfectly accurate. Any explanation in language is a low-dimensional projection of what’s really happening inside AlphaZero’s torrent of matrix multiplies. But the same is true when we use language to describe our own thought processes.
3/ Software actions, work tasks, healthcare, economic data, games… think about all the domains where we do *not* have this language-aligned training data, and what would be possible if we created that data.
3/ Today, the toolbox for overcoming the reality gap is large. Beyond better rendering, we have GANs for domain adaption (e.g. CycleGAN), domain randomization (popular in RL), adversarial losses on neural net activations to force synthetic / real feature invariance, and so on.
Fixing a cloud virtual machine bricked by the CrowdStrike outage is an ideal job for AI agents like Devin.
Nobody wants to do it, but it needs to be done — millions of times. Delegating these tasks to AI frees up engineers to do more interesting work.
Here’s Devin’s fix:
1/ Context: synthetic data has matured drastically in the past 1-2 years. It’s gone from a research niche to a production dependency of many large-scale ML pipelines, especially in computer vision.
Imagining a world of interacting LLMs that depend on others' specific embedding spaces. To upgrade one, you will have to upgrade the versions of all dependencies, or vector spaces will be incompatible. Dependency hell for LLMs incoming!
6/ Without model unit tests, you will see aggregate metrics improving in your evaluation but introduce critical regressions when you actually ship the model. Unit tests are a requirement to durably fix bugs in ML-powered products.
9/ (This is also why Scale is starting a dedicated effort to create or collect language-aligned datasets for new problem domains. If you want help collecting language alignment data for your domain, reach out: language-models
@scale
.com)
Rap Battle - 🤯 demo. Pick any two people and it will generate a rap battle on the fly, using GPT-3 for lyrics, wavenet for vocals, and stable diffusion for the avatars. Sound on!
Live demo:
"Data network effects" claimed by AI startups are overhyped. Unlike most network effects, e.g. social networks where value is O(n^2) in # users, value of more data grows only O(log n). See:
Some applied deep learning principles on the Autopilot team, presented at
@Pytorch
dev summit:
- Every edit to the dataset is a new commit. Track carefully
- Curate hard test sets; random isn’t good enough
- Train all tasks jointly from scratch whenever possible