LLMs are bad at returning code in JSON. They write worse code in structured JSON responses. Even gpt-4o-2024-08-06's strict JSON mode.
LLMs make more syntax errors and also seem overwhelmed by the burden of JSON formatting.
Reflection 70B scored 42% on the aider code editing benchmark, well below Llama3 70B at 49%.
I modified aider to ignore the <thinking/reflection> tags. This model won't work properly with the released aider.
Aider v0.52.0 can suggest & run shell commands:
- Launch a browser to view html/css/js.
- Install depdencies.
- Run DB migrations.
- Run new tests.
- Move files & dirs.
- Etc.
Aider wrote 68% of the code in this release.
Full change log:
Aider v0.51.0
- Prompt caching for Anthropic models with --cache-prompts.
- Repo map speedups in large/mono repos.
- Improved Jupyter Notebook .ipynb file editing.
- Aider wrote 56% of the code in this release.
Full change log:
The new chatgpt-4o-latest is a bit worse at code editing than the prior 4o models. This continues the trend that each OpenAI update within a model family tends to be a bit worse than the last.
OpenAI o1-preview is SOTA at 79.7% on the aider code editing leaderboard, with "whole" edit format. With the practical "diff" format, it ranks between GPT-4o & Sonnet.
80% o1-preview whole
77% Sonnet diff
75% o1-preview diff
75% Sonnet whole
71% GPT-4o
Recently there has been a lot of speculation that Sonnet has been dumbed-down, nerfed or is otherwise performing worse. Sonnet seems as good as ever, when performing the aider code editing benchmark via the API.
Aider v0.53.0 can keep your prompt cache from expiring. It can ping the Anthropic every 5min to keep the cache warm with --cache-keepalive-pings.
Aider wrote 59% of the code in this release.
Aider v0.50.0
- Infinite output for DeepSeek Coder, Mistral and Anthropic models.
- DeepSeek Coder: 8k token output, new --deepseek shortcut.
- Aider wrote 66% of the code in this release.
Full change log:
First benchmark run of o1-mini has it ~tied with gpt-4o on aider's code editing benchmark.
This article will be updated as additional benchmark runs complete:
DeepSeek Coder V2 0724 is
#2
on aider's leaderboard! It can efficiently edit code with SEARCH/REPLACE, unlike the prior version. This unlocks the ability to edit large files. Coder (73%) is close to Sonnet (77%) but 20-50X cheaper!
Aider scored a SOTA 26.3% on the SWE Bench Lite benchmark.
Mainly via existing features for static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming. Without RAG, vector search, LLM tools, web search and other strongly agentic behaviors.
The new gpt-4o-2024-08-06 scores the same as original gpt-4o on aider's code editing benchmarks. At half the cost.
77% Sonnet
73% DeepSeek Coder V2 0724
73% GPT-4o
72% GPT-4o (2024-08-06)
Aider is writing more of its own code since the new Sonnet came out.
Release history page now charts how much of each release was written by aider. Generally between 5-30%. Last couple of releases were >40% written by aider.
Aider v0.56.0
- Prompt caching for Sonnet via OpenRouter, by GitHub user fry69.
- 8k output tokens for Sonnet via VertexAI and DeepSeek V2.5.
- Use --chat-language to set spoken language.
- Aider wrote 56% of the code in this release.
Full change log:
Aider v0.57.0
- Support for OpenAI o1 models.
- o1-preview now SOTA with diff edit format, not whole.
- Support for 08-2024 Cohere models.
- Aider wrote 70% of the code in this release.
Full change log:
Gemini 1.5 Pro 0827 performs similar to Llama 405b on aider's code editing benchmark. Results for the new Pro and Flash, with others for scale:
77% Sonnet
67% Gemini 1.5 Pro 0827
66% Llama 405b
58% GPT 3.5 Turbo 0301
53% Gemini 1.5 Flash 0827
Aider is SOTA on the main SWE Bench, scoring 18.9% vs Devin at 13.9%, AmazonQ at 13.8% . So aider is now SOTA on both SWE Bench & SWE Bench Lite.
Achieved via static code analysis, reliable LLM code editing, auto-fixing lint/test errors; not slow, expensive "agentic" behaviors.
Aider v0.54.0
- New: gemini/gemini-1.5-pro-exp-0827 & gemini/gemini-1.5-flash-exp-0827
- Shell cmds can be interactive, share output with the LLM
- Install latest aider with --upgrade
Aider wrote 64% of the code in this release.
Full change log:
Aider v0.49.0
- Add read-only files with /read, even from outside the git repo.
- Paste images/text into the chat with /clipboard.
- /web shows markdown version of scraped page.
- Aider wrote 61% of the code in this release.
See release notes for more:
Aider v0.48.0
- Perf improvements for large/mono repos, including --subtree-only to focus on current subtree.
- Add images from clipboard with /add-clipboard-image
- Sonnet 8k output.
- Aider wrote 44% of the code in this release.
Sonnet is the opposite of lazy! It's so diligent it writes more code than the 4k output token limit.
Aider now lets Sonnet spread code across multiple API calls. This jumped it from 55->64% on the refactoring leaderboard.
New Qwen 2.5 models are on aider's leaderboard:
65% Qwen-2.5-72b-instruct
52% Qwen2.5-coder:7b-instruct-q8_0
The 7b model scored below the 57% reported by the Qwen Coder team, likely due to quantization. Thanks
@youknow04
for benchmarking the 7b model.
Aider v0.46.0
- New /ask <question> command to ask about your code, without making any edits.
- New /chat-mode <mode> command to switch chat modes (ask, help, code).
- Aider wrote 45% of the code in this release.
Yi-Coder scored below GPT-3.5 on aider's code editing benchmark. GitHub user cheahjs submitted results for the full 9b model and a q4_0 version. Results, with other models for scale:
77% Sonnet
58% GPT-3.5
54% Yi-Coder-9b-Chat
45% Yi-Coder-9b-Chat-q4_0
Aider v0.47.0
- Now uses "Conventional Commit" messages, customize with --commit-prompt.
- New docker image, includes all extras.
- Better lint flow.
The first release where aider wrote more code than I did (58% vs 42%).
GPT 4o mini scores like the original GPT 3.5 on aider's code editing benchmark (later 3.5s were worse). It doesn't seem capable of editing code with diffs on first blush, which limits its use to smaller files.
DeepSeek Chat V2.5 scored ~same as Coder V2 on aider's code editing benchmark.
V2.5 did ~same as V2 on the challenging refactor benchmark. Still far behind Opus/Sonnet/GPT-4o there, so probably less useful for real world coding tasks.
Great new aider tutorial from
@IndyDevDan
showing an excellent workflow for incrementally building non-trivial software. This isn't a snake game or one-shot todo app prototype. Aider works best when you build big things in small steps.
Mistral Large 2 (2407) scored only 60% on aider's code editing benchmark. This puts it just ahead of the best GPT-3.5 model. It doesn't seem able to reliably use search/replace to efficiently edit code.
It's been a busy day of leaderboard updates!
Claude 3.5 Sonnet is now the top ranked model on aider’s code editing leaderboard! DeepSeek Coder V2 took the
#1
spot only 4 days ago.
Sonnet ranked
#1
with the “whole” editing format. It also scored very well with aider’s “diff” editing format.
Aider's token usage at OpenRouter is up 3X in 3 weeks! Now the
#10
app, using 111M tok/day. Aider is their
#2
user of Sonnet and
#12
of GPT-4o.
Aider users seem to like
@OpenRouterAI
's high rate limits and good reliability.
Aider v0.55.0 is a major bugfix release.
- Offer to create a GitHub Issue pre-filled with info about any crashes.
- Numerous corner case bug fixes submitted via pre-filled issues.
- Aider wrote 53% of the code in this release.
Full change log:
DeepSeek Chat V2 0628 from
@deepseek_ai
is
#4
on aider's code editing leaderboard! Just ahead of Opus, close behind GPT-4o. It can efficiently edit large files using diffs. Priced lower than GPT-4o mini, but with frontier coding skill!
Aider v0.43.0
Aider now has a built in chatbot that you can ask for help about using aider, customizing settings, troubleshooting, using LLMs, etc. Type "/help <question>" and aider will respond with helpful information.
@deepseek_ai
I can confirm. DeepSeek Coder V2 is now on top of aider's code editing leaderboard, ahead of GPT-4o and Opus. It benchmarked even better for me than the result shown in
@deepseek_ai
's graph.
Aider v0.44.0
- Default pip install size reduced by 3-12x.
- Added 3 package extras, which aider will offer to install when needed:
- aider-chat[help]
- aider-chat[browser]
- aider-chat[playwright]
- Aider wrote 29% of the code in this release.
The 08-2024 versions of Command-R and Command-R+ perform almost identically on aider's code editing benchmark. They come in above the older version of Command-R+ and just behind our favorite Reflection 70B model.
GPT-4o tops the aider LLM code editing leaderboard at 72.9%, versus 68.4% for Opus. GPT-4o takes second on aider's refactoring leaderboard with 62.9%, versus Opus at 72.3%.
GPT-4o did much better than the 4-turbo models, and seems *much* less lazy.
Aider now has LLM leaderboards that rank popular models according to their ability to edit code. Includes GPT-3.5/4 Turbo, Opus, Sonnet, Gemini 1.5 Pro, Llama 3, Deepseek Coder & Command-R+.
Llama 3.1 405B instruct from
@AIatMeta
is
#7
on aider's leaderboard, well behind Claude 3.5 Sonnet & GPT-4o. When using SEARCH/REPLACE to efficiently edit code, it drops to
#11
.
Aider v0.38.0
- Use --vim for vim keybindings in the chat.
- Add LLM metadata via .aider.models.json file (by caseymcc).
- More detailed error messages on token limit errors.
- Single line commit messages, without the recent chat messages.
Aider v0.41.0
- Allow Claude 3.5 Sonnet to stream back >4k tokens!
- It is the first model capable of writing such large coherent, useful code edits.
- Do large refactors or generate multiple files of new code in one go.
- Aider now uses `claude-3-5-sonnet-20240620` by
Aider v0.40.0
- Improved prompting to discourage Sonnet from wasting tokens emitting unchanging code (
#705
).
- Improved error info for token limit errors.
- Options to suppress adding "(aider)" to the git author and committer names.
- Use --model-settings-file to customize
Aider v0.39.0
- Use --sonnet for Claude 3.5 Sonnet, which is the top model on aider's LLM code editing
leaderboard.
- All AIDER_xxx environment variables can now be set in `.env` (by jpshack-at-palomar).
- Use --llm-history-file to log raw messages sent to the LLM (by
Aider v0.45.0
- Support for GPT 4o mini, using the whole edit format.
- Aider is better at offering to add files to the chat on Windows.
- Aider wrote 42% of the code in this release.
Aider is the
#20
agent using OpenRouter this week, with 249M tokens of AI coding goodness! Aider uses
@ishaan_jaff
's LiteLLM, and is responsible for more than half of LiteLLM's total OpenRouter usage.
Claude 3 beat GPT-4 on aider's code editing benchmark. I’ve been benchmarking the Claude 3 models using Aider’s code editing benchmark suite. Opus outperforms all of OpenAI’s models, making it the best available model for pair programming with AI.
@victormustar
For pragmatic productivity actually I recommend the opposite. Let the AI do the portion it can do. Don't waste time forcing it to do everything. Take over, code past the friction, have the AI start the next chunk of work. This accelerates coders, but doesn't help non-coders.
The code for running aider against SWE Bench Lite is up on GitHub. Here's the ~20 line pseudo-code for the "aider agent" that obtained the SOTA result.
Aider now lints your code after every LLM edit, and offers to automatically fix any errors. It uses tree-sitter to both lint the code and send any errors to the LLM surrounded by code context.
o1-mini did significantly worse with aider's "diff" edit format, which allows models to efficiently specify changes using search/replace blocks. Most other frontier models do well with the diff format.
Updating here as results complete:
@BorisMPower
made a fair point on the layout of the previous SWE Bench Lite graph. It has been corrected. Also, it now shows apples-to-apples comparisons, using pass
@1
from AutoCodeRover and unhinted OpenDevin results.
@AdjectiveAlli
All LLMs have a limit on how many tokens they can output. When they hit the limit, they stop and raise an error. Aider works around this by asking them to continue where they left off in the next message.
OpenAI just released GPT-4 Turbo with Vision and it performs worse on aider’s benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models.
@amirpc
I tuned aider to be more permissive & accept the type of malformed search/replace blocks that o1-preview generates. I do this sort of prompt and editing backend optimization with all the top models. Aider tries to be very permissive about accepting edits from LLMs.
@tom_doerr
Aider tunes the conversation to be cachable. I think it’s a win if you send even just 2 messages. Cache write is 1.25x cost. Hit is 0.1x. So you pay 1.35x instead of 2x cost.
@teortaxesTex
The deepseek v2 result was unexpected, yes.
Surprisingly it was able to use the "diff" edit format. Most smaller models are only able to send back "whole" copies of source files with updates included. Being able to send diffs allows edits to large files and saves tokens.
@Sanity
You can do this in aider: /git checkout -b branch-name
Many folks have strong feelings about branch strategies, so hard for aider to directly adopt one.
I ran the new `gpt-4-0125-preview` model through aider's laziness benchmark. Even though OpenAI claims it is less lazy, it appears to be lazier than the previous `gpt-4-1106-preview` model.
@anthony_barker
That's a good question!
GPT-3.5 models seem to get worse at code editing with each release. The GPT-4 models the same, until 4o.
No other model family has enough history to establish a trend. So many new models capable of code editing released recently.
@NickADobos
Maybe. Mini only scored 55% on aider's benchmark, which is underwhelming. Aider supports unlimited output tokens with Sonnet, which has the top benchmark score at 77%. It's my daily driver these days, and is changing how I code with AI.
@NickADobos
Some of those data points use prompt caching. When I added prompt caching to aider, I benchmarked it to make sure it didn't cause a performance regression.
@amanrsanger
Super interesting work! Sounds like you use a strong model (opus/gpt-4o) to generate code changes and a weak model to "apply" them to the file? I've played with this for a ~year, but had concerns:
1. Adding a 2nd inference step adds latency. Your work helps here!
2. Can only
@OfirPress
Copilot is autocomplete. People seem to use "agent" to mean a black box that takes a (hopefully) unambiguous & complete task description; it unilaterally burns time & tokens to (sometimes) solve it. Aider lives in between: a pair programming UX, where you collaborate with AI.
@yeemanchoi
All of aider's results are unhinted pass
@1
results. They are presented along side apples-to-apples comparisons with unhinted pass
@1
results from the other agents. I have updated the article to make this more clear.
@ellev3n11
No benchmark is perfect. I use benchmarks to get pragmatic, quantitative guidance on how changes to aider affect coding performance and to roughly assess how well models can edit code.
@abacaj
@altryne
You can have aider write commit messages for the code you write too. I mostly do this now to automagically fix all linting nits and add a commit message:
aider --lint --commit
@ashutoshmehra
The refactoring leaderboard is directly relevant to laziness. See the original article that describes the benchmark and how it was used to provoke, assess and ultimately reduce the laziness of GPT 4 Turbo.
@simonw
The README now has more info on the benchmark report yaml. It contains everything needed to reproduce a run, including the git hash of the repo. If you check out that hash, you can see the prompt (and all code) that was used.
@NickADobos
I actually benchmarked gpt-4-turbo preview models at different times of day to detect any load/peak based laziness effects. Didn't find any evidence of that despite a significant effort. This work was part of this effort:
@eyal_eg
The SOTA result from up to 6 iterations with GPT-4o & Opus cost $908.83 to run all 300 problems. So $3.03 per problem.
Using just GPT-4o for a single iteration tied the prior SOTA. It cost $130.91 for all 300, so $0.44 per problem.