Paul Gauthier @paulgauthier Twitter profile

Last Seen Profiles

@HofmannAviation

@twergcc

@reporterdespoir

@mommymoding

@louislouisrob

@DeloraSexgirl

@Iptv4k48104

@Jilmek

@ElinoMorris

@inhellofficial

@glynnshine

@toppybanks

@AlbaZevon

@Achimel_DL

@LaaLayer

@stw_pdg

@Aluka2102

@pabloalfaroh

@astarti11a

@DrTaco1982

@BibiKalsom

@KaelanRamos

@ufotheologian

@swamoss51

@khairallax

@YiMingReal

@GeorgeNicholasK

@cois0803

@taivancel

@StaceyWats53990

@WeakGameForever

@0itang

@348752a

@thomaswdowling

@_elaiz_

@gimi109

Paul Gauthier

@paulgauthier

2 months

LLMs are bad at returning code in JSON. They write worse code in structured JSON responses. Even gpt-4o-2024-08-06's strict JSON mode. LLMs make more syntax errors and also seem overwhelmed by the burden of JSON formatting.

64

81

533

Paul Gauthier

@paulgauthier

4 months

Deepseek Coder V2 from @deepseek_ai is now the top scoring model on aider's code editing leaderboard!

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

aider.chat

11

33

360

Paul Gauthier

@paulgauthier

26 days

Reflection 70B scored 42% on the aider code editing benchmark, well below Llama3 70B at 49%. I modified aider to ignore the <thinking/reflection> tags. This model won't work properly with the released aider.

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

aider.chat

14

27

291

Paul Gauthier

@paulgauthier

1 month

Aider v0.52.0 can suggest & run shell commands: - Launch a browser to view html/css/js. - Install depdencies. - Run DB migrations. - Run new tests. - Move files & dirs. - Etc. Aider wrote 68% of the code in this release. Full change log:

23

44

271

Paul Gauthier

@paulgauthier

1 month

Aider v0.51.0 - Prompt caching for Anthropic models with --cache-prompts. - Repo map speedups in large/mono repos. - Improved Jupyter Notebook .ipynb file editing. - Aider wrote 56% of the code in this release. Full change log:

16

34

269

Paul Gauthier

@paulgauthier

2 months

The new chatgpt-4o-latest is a bit worse at code editing than the prior 4o models. This continues the trend that each OpenAI update within a model family tends to be a bit worse than the last.

20

37

251

Paul Gauthier

@paulgauthier

19 days

OpenAI o1-preview is SOTA at 79.7% on the aider code editing leaderboard, with "whole" edit format. With the practical "diff" format, it ranks between GPT-4o & Sonnet. 80% o1-preview whole 77% Sonnet diff 75% o1-preview diff 75% Sonnet whole 71% GPT-4o

13

28

246

Paul Gauthier

@paulgauthier

1 month

Recently there has been a lot of speculation that Sonnet has been dumbed-down, nerfed or is otherwise performing worse. Sonnet seems as good as ever, when performing the aider code editing benchmark via the API.

22

16

243

Paul Gauthier

@paulgauthier

1 month

Aider v0.53.0 can keep your prompt cache from expiring. It can ping the Anthropic every 5min to keep the cache warm with --cache-keepalive-pings. Aider wrote 59% of the code in this release.

Prompt caching

Aider supports prompt caching for cost savings and faster coding.

aider.chat

12

24

211

Paul Gauthier

@paulgauthier

2 months

Aider v0.50.0 - Infinite output for DeepSeek Coder, Mistral and Anthropic models. - DeepSeek Coder: 8k token output, new --deepseek shortcut. - Aider wrote 66% of the code in this release. Full change log:

11

28

196

Paul Gauthier

@paulgauthier

20 days

First benchmark run of o1-mini has it ~tied with gpt-4o on aider's code editing benchmark. This article will be updated as additional benchmark runs complete:

7

21

184

Paul Gauthier

@paulgauthier

2 months

DeepSeek Coder V2 0724 is #2 on aider's leaderboard! It can efficiently edit code with SEARCH/REPLACE, unlike the prior version. This unlocks the ability to edit large files. Coder (73%) is close to Sonnet (77%) but 20-50X cheaper!

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

aider.chat

5

21

173

Paul Gauthier

@paulgauthier

4 months

Aider scored a SOTA 26.3% on the SWE Bench Lite benchmark. Mainly via existing features for static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming. Without RAG, vector search, LLM tools, web search and other strongly agentic behaviors.

9

27

158

Paul Gauthier

@paulgauthier

2 months

The new gpt-4o-2024-08-06 scores the same as original gpt-4o on aider's code editing benchmarks. At half the cost. 77% Sonnet 73% DeepSeek Coder V2 0724 73% GPT-4o 72% GPT-4o (2024-08-06)

8

24

157

Paul Gauthier

@paulgauthier

2 months

Aider is writing more of its own code since the new Sonnet came out. Release history page now charts how much of each release was written by aider. Generally between 5-30%. Last couple of releases were >40% written by aider.

5

13

158

Paul Gauthier

@paulgauthier

2 months

Code editing skill for the new models, with Sonnet & GPT-3.5 for scale. 77% claude-3.5-sonnet 73% DeepSeek Coder V2 0724 66% llama-3.1-405b-instruct 60% Mistral Large 2 (2407) 59% llama-3.1-70b-instruct 58% gpt-3.5-turbo-0301 38% llama-3.1-8b-instruct

13

20

154

Paul Gauthier

@paulgauthier

22 days

Aider v0.56.0 - Prompt caching for Sonnet via OpenRouter, by GitHub user fry69. - 8k output tokens for Sonnet via VertexAI and DeepSeek V2.5. - Use --chat-language to set spoken language. - Aider wrote 56% of the code in this release. Full change log:

Release history

Release notes and stats on aider writing its own code.

aider.chat

8

12

152

Paul Gauthier

@paulgauthier

11 days

Aider v0.57.0 - Support for OpenAI o1 models. - o1-preview now SOTA with diff edit format, not whole. - Support for 08-2024 Cohere models. - Aider wrote 70% of the code in this release. Full change log:

Release history

Release notes and stats on aider writing its own code.

aider.chat

6

20

183

Paul Gauthier

@paulgauthier

1 month

Gemini 1.5 Pro 0827 performs similar to Llama 405b on aider's code editing benchmark. Results for the new Pro and Flash, with others for scale: 77% Sonnet 67% Gemini 1.5 Pro 0827 66% Llama 405b 58% GPT 3.5 Turbo 0301 53% Gemini 1.5 Flash 0827

11

137

Paul Gauthier

@paulgauthier

4 months

Aider is SOTA on the main SWE Bench, scoring 18.9% vs Devin at 13.9%, AmazonQ at 13.8% . So aider is now SOTA on both SWE Bench & SWE Bench Lite. Achieved via static code analysis, reliable LLM code editing, auto-fixing lint/test errors; not slow, expensive "agentic" behaviors.

9

23

134

Paul Gauthier

@paulgauthier

1 month

Aider v0.54.0 - New: gemini/gemini-1.5-pro-exp-0827 & gemini/gemini-1.5-flash-exp-0827 - Shell cmds can be interactive, share output with the LLM - Install latest aider with --upgrade Aider wrote 64% of the code in this release. Full change log:

Release history

Release notes and stats on aider writing its own code.

aider.chat

7

8

128

Paul Gauthier

@paulgauthier

2 months

Aider v0.49.0 - Add read-only files with /read, even from outside the git repo. - Paste images/text into the chat with /clipboard. - /web shows markdown version of scraped page. - Aider wrote 61% of the code in this release. See release notes for more:

Release history

Release notes and stats on aider writing its own code.

aider.chat

6

12

127

Paul Gauthier

@paulgauthier

2 months

Aider v0.48.0 - Perf improvements for large/mono repos, including --subtree-only to focus on current subtree. - Add images from clipboard with /add-clipboard-image - Sonnet 8k output. - Aider wrote 44% of the code in this release.

Release history

Release notes and stats on aider writing its own code.

aider.chat

5

11

122

Paul Gauthier

@paulgauthier

3 months

Sonnet is the opposite of lazy! It's so diligent it writes more code than the 4k output token limit. Aider now lets Sonnet spread code across multiple API calls. This jumped it from 55->64% on the refactoring leaderboard.

Sonnet is the opposite of lazy

Claude 3.5 Sonnet can easily write more good code than fits in one 4k token API response.

aider.chat

6

17

117

Paul Gauthier

@paulgauthier

12 days

New Qwen 2.5 models are on aider's leaderboard: 65% Qwen-2.5-72b-instruct 52% Qwen2.5-coder:7b-instruct-q8_0 The 7b model scored below the 57% reported by the Qwen Coder team, likely due to quantization. Thanks @youknow04 for benchmarking the 7b model.

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

aider.chat

5

17

106

Paul Gauthier

@paulgauthier

2 months

Aider v0.46.0 - New /ask <question> command to ask about your code, without making any edits. - New /chat-mode <mode> command to switch chat modes (ask, help, code). - Aider wrote 45% of the code in this release.

Release history

Release notes and stats on aider writing its own code.

aider.chat

4

6

96

Paul Gauthier

@paulgauthier

27 days

Yi-Coder scored below GPT-3.5 on aider's code editing benchmark. GitHub user cheahjs submitted results for the full 9b model and a q4_0 version. Results, with other models for scale: 77% Sonnet 58% GPT-3.5 54% Yi-Coder-9b-Chat 45% Yi-Coder-9b-Chat-q4_0

5

13

94

Paul Gauthier

@paulgauthier

2 months

Aider v0.47.0 - Now uses "Conventional Commit" messages, customize with --commit-prompt. - New docker image, includes all extras. - Better lint flow. The first release where aider wrote more code than I did (58% vs 42%).

Release history

Release notes and stats on aider writing its own code.

aider.chat

4

8

93

Paul Gauthier

@paulgauthier

3 months

GPT 4o mini scores like the original GPT 3.5 on aider's code editing benchmark (later 3.5s were worse). It doesn't seem capable of editing code with diffs on first blush, which limits its use to smaller files.

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

aider.chat

5

6

89

Paul Gauthier

@paulgauthier

27 days

DeepSeek Chat V2.5 scored ~same as Coder V2 on aider's code editing benchmark. V2.5 did ~same as V2 on the challenging refactor benchmark. Still far behind Opus/Sonnet/GPT-4o there, so probably less useful for real world coding tasks.

7

13

89

Paul Gauthier

@paulgauthier

22 days

Great new aider tutorial from @IndyDevDan showing an excellent workflow for incrementally building non-trivial software. This isn't a snake game or one-shot todo app prototype. Aider works best when you build big things in small steps.

SECRET SAUCE of AI Coding? AI Devlog with Aider, Cursor, Bun and...

What's the secret sauce of HIGH OUTPUT AI Coding?🔗 More AI Coding with AIDERhttps://youtu.be/ag-KxYS8Vuw🚀 More AI Coding with Cursorhttps://youtu.be/V9_Rzj...

www.youtube.com

3

7

86

Paul Gauthier

@paulgauthier

2 months

Mistral Large 2 (2407) scored only 60% on aider's code editing benchmark. This puts it just ahead of the best GPT-3.5 model. It doesn't seem able to reliably use search/replace to efficiently edit code. It's been a busy day of leaderboard updates!

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

aider.chat

4

10

81

Paul Gauthier

@paulgauthier

3 months

Claude 3.5 Sonnet is now the top ranked model on aider’s code editing leaderboard! DeepSeek Coder V2 took the #1 spot only 4 days ago. Sonnet ranked #1 with the “whole” editing format. It also scored very well with aider’s “diff” editing format.

3

16

77

Paul Gauthier

@paulgauthier

2 months

Aider's token usage at OpenRouter is up 3X in 3 weeks! Now the #10 app, using 111M tok/day. Aider is their #2 user of Sonnet and #12 of GPT-4o. Aider users seem to like @OpenRouterAI 's high rate limits and good reliability.

3

5

77

Paul Gauthier

@paulgauthier

28 days

Aider v0.55.0 is a major bugfix release. - Offer to create a GitHub Issue pre-filled with info about any crashes. - Numerous corner case bug fixes submitted via pre-filled issues. - Aider wrote 53% of the code in this release. Full change log:

Release history

Release notes and stats on aider writing its own code.

aider.chat

7

11

76

Paul Gauthier

@paulgauthier

2 months

DeepSeek Chat V2 0628 from @deepseek_ai is #4 on aider's code editing leaderboard! Just ahead of Opus, close behind GPT-4o. It can efficiently edit large files using diffs. Priced lower than GPT-4o mini, but with frontier coding skill!

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

aider.chat

3

10

68

Paul Gauthier

@paulgauthier

3 months

Aider v0.43.0 Aider now has a built in chatbot that you can ask for help about using aider, customizing settings, troubleshooting, using LLMs, etc. Type "/help <question>" and aider will respond with helpful information.

Using /help

Use “/help " to ask for help about using aider, customizing settings, troubleshooting, using LLMs, etc.

aider.chat

4

6

68

Paul Gauthier

@paulgauthier

4 months

@deepseek_ai I can confirm. DeepSeek Coder V2 is now on top of aider's code editing leaderboard, ahead of GPT-4o and Opus. It benchmarked even better for me than the result shown in @deepseek_ai 's graph.

1

12

66

Paul Gauthier

@paulgauthier

3 months

Aider v0.44.0 - Default pip install size reduced by 3-12x. - Added 3 package extras, which aider will offer to install when needed: - aider-chat[help] - aider-chat[browser] - aider-chat[playwright] - Aider wrote 29% of the code in this release.

6

65

Paul Gauthier

@paulgauthier

3 months

Aider is #1 for DeepSeek Coder V2 on OpenRouter, and #8 for Claude 3.5 Sonnet this week. That's a lot of AI pair programming productivity!

5

1

54

Paul Gauthier

@paulgauthier

21 days

The 08-2024 versions of Command-R and Command-R+ perform almost identically on aider's code editing benchmark. They come in above the older version of Command-R+ and just behind our favorite Reflection 70B model.

5

6

53

Paul Gauthier

@paulgauthier

5 months

GPT-4o tops the aider LLM code editing leaderboard at 72.9%, versus 68.4% for Opus. GPT-4o takes second on aider's refactoring leaderboard with 62.9%, versus Opus at 72.3%. GPT-4o did much better than the 4-turbo models, and seems *much* less lazy.

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

aider.chat

1

9

48

Paul Gauthier

@paulgauthier

3 months

Aider v0.42.0 - Performance release: - 5X faster launch! - Faster auto-complete in large git repos (users report ~100X speedup)!

5

2

43

Paul Gauthier

@paulgauthier

25 days

For clarity, the 42% score was without the specific recommended system prompt. With that prompt, it scored 43%.

5

0

41

Paul Gauthier

@paulgauthier

5 months

Aider now has LLM leaderboards that rank popular models according to their ability to edit code. Includes GPT-3.5/4 Turbo, Opus, Sonnet, Gemini 1.5 Pro, Llama 3, Deepseek Coder & Command-R+.

3

7

40

Paul Gauthier

@paulgauthier

2 months

Llama 3.1 405B instruct from @AIatMeta is #7 on aider's leaderboard, well behind Claude 3.5 Sonnet & GPT-4o. When using SEARCH/REPLACE to efficiently edit code, it drops to #11 .

Aider LLM Leaderboards

Quantitative benchmarks of LLM code editing skill.

aider.chat

3

8

40

Paul Gauthier

@paulgauthier

4 months

Aider v0.38.0 - Use --vim for vim keybindings in the chat. - Add LLM metadata via .aider.models.json file (by caseymcc). - More detailed error messages on token limit errors. - Single line commit messages, without the recent chat messages.

Release history

Release notes and stats on aider writing its own code.

aider.chat

0

4

39

Paul Gauthier

@paulgauthier

1 month

@itsPaulAi This is a great video, thanks for making and sharing it. I've added it to aider's tutorial page.

Tutorial videos

Intro and tutorial videos made by aider users.

aider.chat

1

6

36

Paul Gauthier

@paulgauthier

3 months

Aider v0.41.0 - Allow Claude 3.5 Sonnet to stream back >4k tokens! - It is the first model capable of writing such large coherent, useful code edits. - Do large refactors or generate multiple files of new code in one go. - Aider now uses `claude-3-5-sonnet-20240620` by

2

4

32

Paul Gauthier

@paulgauthier

1 month

@karpathy @_devalias Aider has had this for awhile. You can have it fix any linting nits and make a commit message for you with: aider —lint —commit

Home

aider is AI pair programming in your terminal

aider.chat

2

1

31

Paul Gauthier

@paulgauthier

3 months

Aider v0.40.0 - Improved prompting to discourage Sonnet from wasting tokens emitting unchanging code ( #705 ). - Improved error info for token limit errors. - Options to suppress adding "(aider)" to the git author and committer names. - Use --model-settings-file to customize

1

0

31

Paul Gauthier

@paulgauthier

3 months

Aider v0.39.0 - Use --sonnet for Claude 3.5 Sonnet, which is the top model on aider's LLM code editing leaderboard. - All AIDER_xxx environment variables can now be set in `.env` (by jpshack-at-palomar). - Use --llm-history-file to log raw messages sent to the LLM (by

3

5

29

Paul Gauthier

@paulgauthier

3 months

Aider v0.45.0 - Support for GPT 4o mini, using the whole edit format. - Aider is better at offering to add files to the chat on Windows. - Aider wrote 42% of the code in this release.

Release history

Release notes and stats on aider writing its own code.

aider.chat

0

4

29

Paul Gauthier

@paulgauthier

28 days

Aider wrote >800 lines of code in this release. A new record.

2

1

26

Paul Gauthier

@paulgauthier

3 months

Aider is the #20 agent using OpenRouter this week, with 249M tokens of AI coding goodness! Aider uses @ishaan_jaff 's LiteLLM, and is responsible for more than half of LiteLLM's total OpenRouter usage.

1

21

Paul Gauthier

@paulgauthier

2 months

@aaron__vi The problem isn’t the json. That’s fine. The code inside is worse and has syntax errors.

2

0

19

Paul Gauthier

@paulgauthier

7 months

Claude 3 beat GPT-4 on aider's code editing benchmark. I’ve been benchmarking the Claude 3 models using Aider’s code editing benchmark suite. Opus outperforms all of OpenAI’s models, making it the best available model for pair programming with AI.

2

5

17

Paul Gauthier

@paulgauthier

1 month

@victormustar For pragmatic productivity actually I recommend the opposite. Let the AI do the portion it can do. Don't waste time forcing it to do everything. Take over, code past the friction, have the AI start the next chunk of work. This accelerates coders, but doesn't help non-coders.

2

0

17

Paul Gauthier

@paulgauthier

4 months

Aider has written 7% of its own code, via 600+ commits that inserted 4.8K and deleted 1.5K lines of code.

Aider has written 7% of its own code

Aider has written 7% of its own code, via 600+ commits that inserted 4.8K and deleted 1.5K lines of code.

aider.chat

2

1

17

Paul Gauthier

@paulgauthier

4 months

The code for running aider against SWE Bench Lite is up on GitHub. Here's the ~20 line pseudo-code for the "aider agent" that obtained the SOTA result.

1

0

16

Paul Gauthier

@paulgauthier

4 months

Aider now lints your code after every LLM edit, and offers to automatically fix any errors. It uses tree-sitter to both lint the code and send any errors to the LLM surrounded by code context.

Linting code for LLMs with tree-sitter

Aider now lints code after every LLM edit and automatically fixes errors, using tree-sitter and AST-aware code context.

aider.chat

3

0

15

Paul Gauthier

@paulgauthier

2 months

@Shawnryan96 No doubt. The price and speed improvements over the last year have been astounding.

1

0

13

Paul Gauthier

@paulgauthier

19 days

o1-mini did significantly worse with aider's "diff" edit format, which allows models to efficiently specify changes using search/replace blocks. Most other frontier models do well with the diff format. Updating here as results complete:

2

1

13

Paul Gauthier

@paulgauthier

4 months

@BorisMPower made a fair point on the layout of the previous SWE Bench Lite graph. It has been corrected. Also, it now shows apples-to-apples comparisons, using pass @1 from AutoCodeRover and unhinted OpenDevin results.

1

2

13

Paul Gauthier

@paulgauthier

2 months

@AdjectiveAlli All LLMs have a limit on how many tokens they can output. When they hit the limit, they stop and raise an error. Aider works around this by asking them to continue where they left off in the next message.

1

0

12

Paul Gauthier

@paulgauthier

6 months

OpenAI just released GPT-4 Turbo with Vision and it performs worse on aider’s benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models.

3

4

12

Paul Gauthier

@paulgauthier

2 months

@AdjectiveAlli The models can output as much code as they want, without regard for their output token limit.

1

11

Paul Gauthier

@paulgauthier

5 months

Use aider's new browser UI to collaborate with LLMs to edit code in your local git repo.

2

1

11

Paul Gauthier

@paulgauthier

1 month

@drivelinekyle I just updated the tips section of the docs. Let me know if this is helpful.

Tips

Tips for AI pair programming with aider.

aider.chat

0

11

Paul Gauthier

@paulgauthier

11 days

@amirpc I tuned aider to be more permissive & accept the type of malformed search/replace blocks that o1-preview generates. I do this sort of prompt and editing backend optimization with all the top models. Aider tries to be very permissive about accepting edits from LLMs.

2

0

12

Paul Gauthier

@paulgauthier

1 month

@tom_doerr Aider tunes the conversation to be cachable. I think it’s a win if you send even just 2 messages. Cache write is 1.25x cost. Hit is 0.1x. So you pay 1.35x instead of 2x cost.

1

11

Paul Gauthier

@paulgauthier

5 months

@teortaxesTex The deepseek v2 result was unexpected, yes. Surprisingly it was able to use the "diff" edit format. Most smaller models are only able to send back "whole" copies of source files with updates included. Being able to send diffs allows edits to large files and saves tokens.

1

0

9

Paul Gauthier

@paulgauthier

1 month

@Sanity You can do this in aider: /git checkout -b branch-name Many folks have strong feelings about branch strategies, so hard for aider to directly adopt one.

1

0

9

Paul Gauthier

@paulgauthier

28 days

@RockzMRockz @deepseek_ai @AnthropicAI Yes aider supports prompt caching for deepseek too.

1

0

8

Paul Gauthier

@paulgauthier

8 months

I ran the new `gpt-4-0125-preview` model through aider's laziness benchmark. Even though OpenAI claims it is less lazy, it appears to be lazier than the previous `gpt-4-1106-preview` model.

1

7

Paul Gauthier

@paulgauthier

2 months

@eyal_eg Sonnet.

0

8

Paul Gauthier

@paulgauthier

25 days

@ruben_kostard I re-ran it with the specified system prompt and got 43%. So it didn't seem to make a difference.

1

0

8

Paul Gauthier

@paulgauthier

5 months

Just now added @deepseek_ai 's newest DeepSeek-V2 model, which scores better than everything except GPT-4 variants and Opus.

1

0

6

Paul Gauthier

@paulgauthier

5 months

@anthony_barker That's a good question! GPT-3.5 models seem to get worse at code editing with each release. The GPT-4 models the same, until 4o. No other model family has enough history to establish a trend. So many new models capable of code editing released recently.

0

3

7

Paul Gauthier

@paulgauthier

3 months

@NickADobos Maybe. Mini only scored 55% on aider's benchmark, which is underwhelming. Aider supports unlimited output tokens with Sonnet, which has the top benchmark score at 77%. It's my daily driver these days, and is changing how I code with AI.

0

7

Paul Gauthier

@paulgauthier

1 month

Thanks to cheahjs on GitHub for benchmarking these models despite the strict rate limits.

docs: add benchmark results for new gemini experimental models by cheahjs · Pull Request #1200 ·...

Added benchmark results for the 3 new experimental Gemini models (Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 1.5 Flash-8B) available on AI Studio announced yesterday: https://x.com/OfficialLoganK/sta...

github.com

0

7

Paul Gauthier

@paulgauthier

1 month

@NickADobos Some of those data points use prompt caching. When I added prompt caching to aider, I benchmarked it to make sure it didn't cause a performance regression.

0

7

Paul Gauthier

@paulgauthier

14 years

hello, twitter

1

6

Paul Gauthier

@paulgauthier

5 months

@amanrsanger Super interesting work! Sounds like you use a strong model (opus/gpt-4o) to generate code changes and a weak model to "apply" them to the file? I've played with this for a ~year, but had concerns: 1. Adding a 2nd inference step adds latency. Your work helps here! 2. Can only

1

0

3

Paul Gauthier

@paulgauthier

1 month

@tom_doerr When streaming Anthropic doesn't return the underlying cache hit data needed for costs.

1

0

6

Paul Gauthier

@paulgauthier

4 months

@OfirPress Copilot is autocomplete. People seem to use "agent" to mean a black box that takes a (hopefully) unambiguous & complete task description; it unilaterally burns time & tokens to (sometimes) solve it. Aider lives in between: a pair programming UX, where you collaborate with AI.

0

5

Paul Gauthier

@paulgauthier

2 months

@meekaale Yes, exactly. After you are done /ask-ing questions you can tell aider "ok, do it" and it will code based on that discussion.

0

6

Paul Gauthier

@paulgauthier

4 months

@yeemanchoi All of aider's results are unhinted pass @1 results. They are presented along side apples-to-apples comparisons with unhinted pass @1 results from the other agents. I have updated the article to make this more clear.

0

1

6

Paul Gauthier

@paulgauthier

27 days

@USEnglish215753 The DeepSeek V2 Chat and DeepSeek Coder V2 models have been merged and upgraded into the new model, DeepSeek V2.5.

Change Log | DeepSeek API Docs

Version: 2024-09-05

platform.deepseek.com

1

0

6

Paul Gauthier

@paulgauthier

2 months

@redlinetheturk Yes, the rate limits are preventing me from benchmarking it.

0

5

Paul Gauthier

@paulgauthier

1 month

@ellev3n11 No benchmark is perfect. I use benchmarks to get pragmatic, quantitative guidance on how changes to aider affect coding performance and to roughly assess how well models can edit code.

0

5

Paul Gauthier

@paulgauthier

1 month

@abacaj @altryne You can have aider write commit messages for the code you write too. I mostly do this now to automagically fix all linting nits and add a commit message: aider --lint --commit

0

5

Paul Gauthier

@paulgauthier

5 months

@ashutoshmehra The refactoring leaderboard is directly relevant to laziness. See the original article that describes the benchmark and how it was used to provoke, assess and ultimately reduce the laziness of GPT 4 Turbo.

Unified diffs make GPT-4 Turbo 3X less lazy

GPT-4 Turbo has a problem with lazy coding, which can be signiciantly improved by asking for code changes formatted as unified diffs.

aider.chat

1

0

5

Paul Gauthier

@paulgauthier

1 month

@simonw The README now has more info on the benchmark report yaml. It contains everything needed to reproduce a run, including the git hash of the repo. If you check out that hash, you can see the prompt (and all code) that was used.

aider/benchmark/README.md at main · paul-gauthier/aider

aider is AI pair programming in your terminal. Contribute to paul-gauthier/aider development by creating an account on GitHub.

github.com

0

4

Paul Gauthier

@paulgauthier

5 months

@NickADobos I actually benchmarked gpt-4-turbo preview models at different times of day to detect any load/peak based laziness effects. Didn't find any evidence of that despite a significant effort. This work was part of this effort:

Unified diffs make GPT-4 Turbo 3X less lazy

GPT-4 Turbo has a problem with lazy coding, which can be signiciantly improved by asking for code changes formatted as unified diffs.

aider.chat

0

4

Paul Gauthier

@paulgauthier

3 months

@dean_rie I've tried this and quantitatively assessed it against SWE Bench Lite. It didn't help, and actually made things worse.

0

4

Paul Gauthier

@paulgauthier

2 months

Coding with Llama 3.1, new DeepSeek Coder & Mistral Large

Summary of code editing skill for the new models, with Sonnet and GPT-3.5 for scale.

aider.chat

0

4

Paul Gauthier

@paulgauthier

4 months

@eyal_eg The SOTA result from up to 6 iterations with GPT-4o & Opus cost $908.83 to run all 300 problems. So $3.03 per problem. Using just GPT-4o for a single iteration tied the prior SOTA. It cost $130.91 for all 300, so $0.44 per problem.

1

0

2

Paul Gauthier

@paulgauthier

2 months

@bitdeep_ Sonnet mostly.

1

0

4