Super super happy to be able to talk about DIDACT, the first code LLM trained to model real software developers editing code, fixing builds, and doing code review end-to-end.
Developers don't write code in one go and neither should our models! 1/n
We've finally put out a detailed IEEE/ACM paper on
@Google
's multi-year effort to ease the burden of code review with ML. Google engineers now resolve 7.5% of all code review comments with an ML-suggested edit. But the path to that number has been a fun ML and UX journey!
Excited to see a blog post on one of the coolest projects I've worked on at Google: using LLMs to automatically resolve code-review comments for Google engineers! 1/n
This is something I've worked on for a while! You can save the activations of one LLM call and reuse them for a follow-up that overlaps with the first.
This means asking a question about a big codebase can take 30 seconds the first time and 1s after that!
the hardest thing about being an AI researcher is having to smell homeless people every morning while munching a tartine croissant outside your $4k house on the way to work
Our new paper! We study how well large language models (244M-137B parameters) can write code, collaborate with humans via dialog (exciting!) and understand/execute the code they write (they don't/can't).
TLDR: exciting tech with lots of limitations and room for future work.
@jacobandreas
@_jasonwei
We found that code models get better when you prompt them with "I'm an expert Python programmer". The new Anthropic paper did something similar, prefixing the model's response with "I’ve tested this function myself so I know that it’s correct:"
Happy to share our work on discrete denoising diffusion models (D3PMs)
@NeurIPSConf
2021: . D3PMs are diffusion models for discrete data like text or (quantized) images, and they’re flexible! A thread (with code!) 1/n
This may be the most magical new developer tool we've made at Google. Nothing since code completion has felt so seamless to use: devs paste code constantly, and Smart Paste instantly fixes all the little issues: syntax errors, misnamed variables, indentation, and more 1/2
Code development often involves frequent copy & pasting of code that must be adjusted for the surrounding context. Here we describe Smart Paste, an internal tool that streamlines the code authoring workflow by automating adjustments to pasted code. More at
Read about our recent work on ML-powered code completion models trained on the
@Google
codebase. A small but specialized LM trained on extremely high-quality data and backed by static analysis beats much larger models in production.
Learn more about how code completion is transforming the developer experience of internal
@Google
engineers! 👩💻
We measured an acceptance rate of 25-34% on >3% of production code, while reducing the coding iteration time by 6% (equating to hundreds of years of SWE hours saved).
GPT-4 makes big gains on coding (e.g. 48% -> 67% on HumanEval) but it's still a long way from 100% pass
@1
, not to mention writing a 1000-line program from scratch.
GPT-4 shows that scale won't solve everything. Models need to write and debug code iteratively, like humans do
Gemini 1.5 Pro is widely available now. Long context is great but it's also just a great model, better than GPT-4 on most of our metrics. And it's free!
We're starting to roll out API support for Gemini 1.5 Pro for developers. We're excited to see what you build with the 1M token context window!
We'll be onboarding people to the API slowly at first, and then we'll ramp it up. In the meantime, developers can try out Gemini 1.5
Full details are in our blog post here: . This was the culmination of years of work from
@dtarlow2
, Petros Maniatis, and a bunch of colleagues across Google. Please take a look!
I won’t be at ICLR this year, but it’s the 200th anniversary of the premier of Beethoven’s 9th in Vienna and you should go! The Wiener Philharmonic and many other symphonies have concerts!
The Blueshift team has done awesome work pushing Hendryck's MATH above 90%. MATH isn't the hardest dataset in the world but it's surprisingly tricky: some problems take me 5-10 minutes to solve. Getting an LLM to solve more than 90% feels meaningful. Try one yourself!
I'm excited about this! Our team has been working really hard to improve Gemini 1.5 capabilities significantly on multiple fronts and in particular MATH/STEM! Please see the report here:
New capabilities in Bard will help programmers and software developers with code generation, debugging and code explanation. It’s an exciting next step in how generative AI can accelerate innovation across industries.
@RichardMCNgo
I find many of these questions exhausting. I don't want to psychoanalyze what about me surprises people to a stranger at 3AM after a few beers. Ask me 1:1 when it's appropriate.
One thing I'm proud of is how Google's gen media team has prioritized building tools for artists rather than text-to-X tools. GenAI can either replace or augment people, let's do the latter!
We put our cutting-edge video generation model Veo in the hands of filmmaker
@DonaldGlover
and his creative studio, Gilga.
Let’s take a look. ↓
#GoogleIO
FWIW I think this is how you make long-context economical. Long queries aren't all unique, they typically share the same source documents. Low latency, low cost full repo completion can reuse the same KV caches
New study! We compared ChatGPT responses to people's medical questions with those of doctors. Healthcare professionals preferred ChatGPT 79% of the time; as more empathetic and higher quality.
I'm excited to figure out how to use LLMs to help doctors!
*cracks knuckles*
and thus, we begin the "🌴PaLM v2" drinking game (but with coffee, tea, or your favorite caffeinated beverage of choice, as it's early! 😉)
#GoogleIO2023
#GoogleIO
Codex-style LLMs are trained on static code snapshots (GitHub files at HEAD) without history or context from the developer's environment (like their IDE or build system). We're throwing away all the data of how the software was built, and why! 2/n
UL2 is a new training objective with big implications for LLM training. UL2 combines the span corruption objective that gives T5 its exceptional finetuning ability with causal and prefix-LM objectives which let UL2-trained LLMs outperform purely-causal LMs on few-shot tasks
Introducing UL2, a novel language pre-training paradigm that improves performance of language models across datasets and setups by using a mixture of training objectives, each with different configurations. Read more and grab model checkpoints at
Google developers work in a monorepo and build errors, test failures, code review comments, and resulting edits are all tracked. DIDACT models are trained on this data to build software iteratively *based on the history of a dev's work so far!* 3/n
There's so much hype around "LLMs as agents" and when building LLMs for software, i think that's exactly the right approach. Our LLMs can build software like humans, iteratively and using developer tools, and be immediately useful for real developers! 5/n
DIDACT powers a ton of cool dev tools, like our recently announced ML-powered code review tool and a bunch of others, like a tool to fix build errors, predict code review comments, and do GitHub Copilot-style completion conditioned on _your_ development history! 4/n
@EigenGender
This is absolutely not true. They could test the explosive design, the subcritical assembly, the gun design. They could detonate the explosives and watch fast X-ray data. And then they had the trinity test
Penzai is one of the coolest ML libraries out there. Not only can you inspect every weight matrix and attention head in a Colab, you can trivially knock out heads, skip or repeat layers, or extract intermediates with a one line change. A beautiful tool for interpretability.
Excited to share Penzai, a JAX research toolkit from
@GoogleDeepMind
for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere.
Check it out on GitHub:
@denny_zhou
If true, this highlights one of the complexities of the half-open OpenAI/GPT-3 ecosystem. I'm a fan of the API, but it's v hard to know what DaVinci-002 is, whether it had a given eval set in its training data, etc.
Code LLMs are everywhere, but making them useful to real developers is hard. We trained an LLM on data from _real_ Google developers: fixing builds, performing code review, and editing files, then deploy it within the code-review UI! 2/n
More work from Google on AI for SWE, here automatically fixing build errors! The cool thing about fixing builds is you can check if the build succeeds before showing the user the fix. Results in a measurable shortening of code submission time too!
Excited to share a new blog on ML-based repair for build errors at Google!
We found that automatically repairing build errors in the IDE increases productivity as measured by overall task completion with no detectable negative impact on code safety!
1/ In 2021, we shared next-gen language + conversation capabilities powered by our Language Model for Dialogue Applications (LaMDA). Coming soon: Bard, a new experimental conversational
#GoogleAI
service powered by LaMDA.
Happy to share our work on multilingual evals for code LLMs, led by
@GOrlanski
. We open-source BabelCode, a framework for running execution-based coding evals across >10 languages (including Rust and Julia) and study the effect of language balancing on low-resource languages 1/2
📢Measuring The Impact Of Programming Language Distribution
We present the BabelCode framework for multi-lingual code evaluation and an investigation into the impact of PL distributions in training data.
Paper:
Code:
🧵
A couple lessons from this:
* IDE wars are coming. Collecting data in the same dev environment you deploy in is a huge advantage.
* LLMs make great demos but it's hard to trust them at complex tasks. Reviewing code is harder than writing it. High-precision, low-recall is OK!
A huge amount of credit goes to the UX team for helping us make model edits understandable, so developers can audit the code that's being changed. Model calibration also becomes surprisingly – building developer trust by only showing highly confident predictions
i found Oppenheimer, like most of Christopher Nolan’s movies, lacking in emotional resonance. Nolan seems to make films about concepts that interest him (time, space, a biography he just read), without worrying about their relevance to the present moment
@_jasonwei
Cost is an important drawback: generalist models will always be outperformed by smaller task-specific models when cost and latency are factored in, except for tasks only the largest models can do. With that said, distillation is likely to play a role
2290 tons of CO2 is a lot, but it's also roughly...38 flights from NYC to London on a 737. More CO2 was probably emitted by Meta employees flying back and forth during model development
So LLaMa 3's carbon footprint is... huge? 🤯
They estimate it to be 2,290 tons of CO2eq, compared to 550t for training GPT-3 and 66t for training *all* of the BLOOM models (1B-176B) 🌬️
Interested in Reasoning with Large Language Models?
We are hiring!
Internship:
Full-Time Research Scientist:
Full-Time Research Engineer:
Learn more about Blueshift Team:
Returning from NeurIPS, I flew an hour the wrong way to Fort Worth, and then missed my flight to NYC. Now I get to experience the cozy embrace of this hard airport floor
The Gemini era is here. Thrilled to launch Gemini 1.0, our most capable & general AI model. Built to be natively multimodal, it can understand many types of info. Efficient & flexible, it comes in 3 sizes each best-in-class & optimized for different uses
You can find the paper here: . I think it's an awesome case study in applied LLM deployment. Huge shoutout to Peter Choy, Alex Frömmgen,
@lerakharatyan
,
@gssurita
, Kevin Villela,
@dtarlow2
, Maxim Tabachnyk, really too many people to list!
Bard is now available in the US and UK, w/more countries to come. It’s great to see early
@GoogleAI
work reflected in it—advances in sequence learning, large neural nets, Transformers, responsible AI techniques, dialog systems & more.
You can try it at
@TheXeophon
Yes, we have a DSL that decomposes the process of writing a PR into actions like "<run build [target]>" or "<make edit [location] [diff]>". The goal is to represent any action a developer could take as a small, local change, instead of making the LLM somehow output a big file
To be clear, I don't mean the "scale won't solve everything" line as a criticism of scaling. I just find it implausible that LLMs can solve arbitrary problems without decomposing them or adapting to feedback from an environment
Your new coding assistant is almost here! Check out these new Colab features: natural language to code generation, code completion, and an integrated chatbot. Read all about at authored by
@thechrisperry
and
@shresbm
@DrJimFan
Big +1 here. The model is implicitly trained on a mixture of p(answer | evidence) and p(answer), so it interpolates between memorizing and looking for answers in-context (see )
@lauralondon_
@moultano
Desalination plants can't prevent flooding when sea-levels rise several meters due to Antarctic ice sheets melting. Burying power lines will reduce wildfire frequency at massive cost, but it won't stop them when rising temperatures lead to ever more arid conditions.
it’s frightening walking around Williamsburg hearing tech grifters talk about their “AI for media” startups. it feels better to work upstream of that, on core tech, but it’s not obvious if my hands are cleaner
@natfriedman
Is this toolformer? Toolformer seems specifically about using prompting + log-likelihood based filtering to enable tool use. The idea of tool use in this form has been around for years
Another aspect of this work to note: it (partly) solves the "specification" problem of program synthesis: how do we tell the computer what code we want it to write?
TLDR: rather than tell a model what to do, let it learn from context what you'll want to do next. A thread 1/n
Very happy to share our work on activating Google's software dev process as an engine for ML-powered dev tools.
A multi-year effort from many across Alphabet. Special shout-out to
@jacobaustin132
@blip42
@PManzagol
@dancherp
& Petros Maniatis.
See Jacob's🧵& the blog for more.
Smart Paste highlights the core UX challenge of AI for SWEs. The more context switching is required to verify a suggestion, the less useful it is. Tools like code completion and Smart Paste that make suggestions at the cursor and are instantly verifiable are the easiest to adopt
@_jasonwei
Character can make money without "getting something right". As you point out, exploiting loneliness/insecurity is lucrative. The fact that shamelessly monetizes a desire for connection (where OAI/Anthro refused) speaks badly, ironically, of their character
We first talked about this project in mid-2022 in a
@GoogleAI
blog post (here's a thread at the time: ), but this paper talks in much more detail about the model and the design process we went through.
Excited to see a blog post on one of the coolest projects I've worked on at Google: using LLMs to automatically resolve code-review comments for Google engineers! 1/n
I loved people like Anthony Bourdain for this reason. You can see him grappling with both the beauty and horror of his life and his art
I wish the AI world had more of this. We cannot know if what we make is good, no matter how well-intentioned we are
To grad school applicants: the single best advice I got was that you’re generally admitted by a single faculty member who’ll bet on you, not by the department. Pick a few people and target your application to them
@docmilanfar
@jaschasd
Strongly agree, I still find this one of the clearest explanations of dynamical systems and stochastic processes, it's quite a joy to read
Today,
@scale_AI
is launching our 2 major platforms to bolster government and enterprise:
🎖 Scale Donovan, the AI copilot for defense
🏙 Scale EGP, full-stack generative AI for global enterprise
👇 See Donovan in action below
🧵 on our platforms and why they are so critical
@urialon1
@_jasonwei
Reminds me of the Python-GSM8K results from the PaLM paper or MathQA-Python. Cool to see that intermediate natural language instructions are helpful!
I think rather soon, these models will be helpful for scientists and mathematicians. An LLM doesn't have to do super advanced math to be useful, there's value (to me at least) in instantly proving little lemmas that help keep you in a flow state. More to come!
i see a lot of people calling this "goodharting" but it's sort of not goodharting, it's just leaking the test set
esp. as existing evals are translated into more languages, removing them becomes increasingly hard
How much do LLMs overfit public benchmarks? Our team at
@scale_ai
SEAL lab studied this by creating a GSM8k-equivalent eval from scratch. The resulting performance gap reveals data contamination in some model families, while GPT, Claude, and Gemini show no signs of overfitting.
Which university has the best graduates?
A new paper using an earnings-based measure of graduate quality (qⱼ) provided the answer: the top of the list is dominated by Indian universities.
What about Harvard? Rank
#26
.
Two weeks in London and I managed to make it to Wigmore Hall twice, for
@jeremydenk
playing the Bach Partitas and tonight for the Handel Players. Wigmore Hall is special, like the 92Y in NYC: small, with fantastic acoustics, intimate in the best sense.
The culmination of this week's musical mini-binge – 2nd concert today
@wigmore_hall
– felt somehow fitting after so much marvellous stuff both old and very new:
@jeremydenk
performing (the entire session from memory!) all Bach's Partitas. Magic.
Our first model had a bunch of bad habits: it made low-confidence suggestions, addressed unrelated issues, and wasn't very visible to the change author. To improve, we improved data quality, filtered for single comment reviews, filtered by confidence, and added synthetic data.
@amanrsanger
Training smaller models on a single language alone (e.g. Python-only) can match the performance of Codex at smaller sizes on single language evals. The open source world can't match Codex without huge investment, but there are shortcuts!
Our first version was a lightly-finetuned version of Google's software engineering foundation model DIDACT, and made very plausible suggestions. But people didn't trust it: there's a big difference between a plausible edit and what the developer really wants
I’m excited to announce our new company, MatX, started with
@MikeGunter_
. We want to make AI better, faster, and cheaper by building more powerful hardware. Read on for a short introduction, or see our full announcement here: .
@Ted_Underwood
Over-training + instruction-tuning. As
@moultano
says, OpenAI can e.g. train a 12B model for 10x the "Chinchilla-optimal" compute budget and end up with the same loss as a 10x larger model trained for less time 1/2
All-in-all, we ended up improving user trust in the model and addressing around 7.5% of all code review comments at Google with an ML-edit. All while keeping precision high (usually around 50%) to avoid wasting engineering time!
The lesson of language models for me is that noise generation with language is painfully easy. You have to look at what you write and say “does this say anything new? Could GPT-3 have written this?”
@DynamicWebPaige
@DavidSacks
Fulfilling a request (in this case, to write a slogan) isn’t necessarily at odds with political neutrality in its own answers?
As scaling LLMs becomes harder, performance gains come more and more from clever prompting, bootstrapping, and chaining multiple LLMs together. Cascades is a PPL that makes inference & optimization on chained language models easy!
Happy to release our work on Language Model Cascades. Read on to learn how we can unify existing methods for interacting models (scratchpad/chain of thought, verifiers, tool-use, …) in the language of probabilistic programming.
paper:
idea: add a carbon offset option to every gas station and airline checkout webpage.
even if only 1% of people pay an extra 20% on gas, it creates a market for carbon offsets and puts the idea front and center in people’s minds
@jxmnop
@polynoamial
I think phd students have a pretty great opportunity to publish general-purpose ideas that industry can't publish right now: write a great paper on data selection, length generalization, self-improvement, RL, etc. and include clear scaling laws up to 1B and everyone'll love it
@ben11kehoe
@forrestbrazeal
For now, it's up to the author to approve the change, and yes, then the reviewer needs to re-approve (which they'd normally have to do anyway after the author addressed a comment). We're working on the "pre-approve" UX now, so they can flag that the ML edit is right
@_jasonwei
Why is emergence a useful thing to think about? Is there reason to think "emergence" is anything more than "log likelihood dropping below some critical threshold" (i.e. a function of model quality, not of size or compute)?
@RichardMCNgo
A counterargument (which you've made yourself) is that optimal strategies in primitive or partially-observed environments may not be optimal today, e.g. avoiding pork because of disease or stoning women for adultery in a society that functions without monogamy.
@nearcyan
The alignment crowd has tried to push the term as broadly as possible. Now they reap the rewards.
But LLMs are far more likely to harm society by undermining our notions of truth and creativity than by killing us all
Here's what the UI looks like for the reviewer. The ML suggested edit auto-updates in the code review UI as the reviewer is typing, and they can try to more clearly specify their intent in the comment to guide the ML model!
@andy_matuschak
Having taken piano lessons for 15 years, I think it's just be because it's hard to fit 15 pianos in a room and impossible for them to play at the same time. We do group music lessons for elementary students, but it's mostly chaos. At least you can do math silently
1/n on classical music
yesterday i heard the Pavel Haas quartet playing the Brahms A Major piano quartet at Wigmore Hall with Boris Giltburg. it's the second of Brahms' 3 piano quartets and my favorite. it's tragic and warm, rich, very full, like a Mahler symphony