Paul Gauthier Profile
Paul Gauthier

@paulgauthier

2,462
Followers
85
Following
40
Media
179
Statuses

Entrepreneur, investor, advisor

Southern California
Joined April 2009
Don't wanna be here? Send us removal request.
@paulgauthier
Paul Gauthier
2 months
LLMs are bad at returning code in JSON. They write worse code in structured JSON responses. Even gpt-4o-2024-08-06's strict JSON mode. LLMs make more syntax errors and also seem overwhelmed by the burden of JSON formatting.
Tweet media one
64
81
533
@paulgauthier
Paul Gauthier
4 months
Deepseek Coder V2 from @deepseek_ai is now the top scoring model on aider's code editing leaderboard!
11
33
360
@paulgauthier
Paul Gauthier
26 days
Reflection 70B scored 42% on the aider code editing benchmark, well below Llama3 70B at 49%. I modified aider to ignore the <thinking/reflection> tags. This model won't work properly with the released aider.
14
27
291
@paulgauthier
Paul Gauthier
1 month
Aider v0.52.0 can suggest & run shell commands: - Launch a browser to view html/css/js. - Install depdencies. - Run DB migrations. - Run new tests. - Move files & dirs. - Etc. Aider wrote 68% of the code in this release. Full change log:
23
44
271
@paulgauthier
Paul Gauthier
1 month
Aider v0.51.0 - Prompt caching for Anthropic models with --cache-prompts. - Repo map speedups in large/mono repos. - Improved Jupyter Notebook .ipynb file editing. - Aider wrote 56% of the code in this release. Full change log:
Tweet media one
16
34
269
@paulgauthier
Paul Gauthier
2 months
The new chatgpt-4o-latest is a bit worse at code editing than the prior 4o models. This continues the trend that each OpenAI update within a model family tends to be a bit worse than the last.
Tweet media one
20
37
251
@paulgauthier
Paul Gauthier
19 days
OpenAI o1-preview is SOTA at 79.7% on the aider code editing leaderboard, with "whole" edit format. With the practical "diff" format, it ranks between GPT-4o & Sonnet. 80% o1-preview whole 77% Sonnet diff 75% o1-preview diff 75% Sonnet whole 71% GPT-4o
Tweet media one
13
28
246
@paulgauthier
Paul Gauthier
1 month
Recently there has been a lot of speculation that Sonnet has been dumbed-down, nerfed or is otherwise performing worse. Sonnet seems as good as ever, when performing the aider code editing benchmark via the API.
Tweet media one
22
16
243
@paulgauthier
Paul Gauthier
1 month
Aider v0.53.0 can keep your prompt cache from expiring. It can ping the Anthropic every 5min to keep the cache warm with --cache-keepalive-pings. Aider wrote 59% of the code in this release.
12
24
211
@paulgauthier
Paul Gauthier
2 months
Aider v0.50.0 - Infinite output for DeepSeek Coder, Mistral and Anthropic models. - DeepSeek Coder: 8k token output, new --deepseek shortcut. - Aider wrote 66% of the code in this release. Full change log:
Tweet media one
11
28
196
@paulgauthier
Paul Gauthier
20 days
First benchmark run of o1-mini has it ~tied with gpt-4o on aider's code editing benchmark. This article will be updated as additional benchmark runs complete:
Tweet media one
7
21
184
@paulgauthier
Paul Gauthier
2 months
DeepSeek Coder V2 0724 is #2 on aider's leaderboard! It can efficiently edit code with SEARCH/REPLACE, unlike the prior version. This unlocks the ability to edit large files. Coder (73%) is close to Sonnet (77%) but 20-50X cheaper!
5
21
173
@paulgauthier
Paul Gauthier
4 months
Aider scored a SOTA 26.3% on the SWE Bench Lite benchmark. Mainly via existing features for static code analysis, reliable LLM code editing, and pragmatic UX for AI pair programming. Without RAG, vector search, LLM tools, web search and other strongly agentic behaviors.
Tweet media one
9
27
158
@paulgauthier
Paul Gauthier
2 months
The new gpt-4o-2024-08-06 scores the same as original gpt-4o on aider's code editing benchmarks. At half the cost. 77% Sonnet 73% DeepSeek Coder V2 0724 73% GPT-4o 72% GPT-4o (2024-08-06)
Tweet media one
8
24
157
@paulgauthier
Paul Gauthier
2 months
Aider is writing more of its own code since the new Sonnet came out. Release history page now charts how much of each release was written by aider. Generally between 5-30%. Last couple of releases were >40% written by aider.
Tweet media one
5
13
158
@paulgauthier
Paul Gauthier
2 months
Code editing skill for the new models, with Sonnet & GPT-3.5 for scale. 77% claude-3.5-sonnet 73% DeepSeek Coder V2 0724 66% llama-3.1-405b-instruct 60% Mistral Large 2 (2407) 59% llama-3.1-70b-instruct 58% gpt-3.5-turbo-0301 38% llama-3.1-8b-instruct
Tweet media one
13
20
154
@paulgauthier
Paul Gauthier
22 days
Aider v0.56.0 - Prompt caching for Sonnet via OpenRouter, by GitHub user fry69. - 8k output tokens for Sonnet via VertexAI and DeepSeek V2.5. - Use --chat-language to set spoken language. - Aider wrote 56% of the code in this release. Full change log:
8
12
152
@paulgauthier
Paul Gauthier
11 days
Aider v0.57.0 - Support for OpenAI o1 models. - o1-preview now SOTA with diff edit format, not whole. - Support for 08-2024 Cohere models. - Aider wrote 70% of the code in this release. Full change log:
6
20
183
@paulgauthier
Paul Gauthier
1 month
Gemini 1.5 Pro 0827 performs similar to Llama 405b on aider's code editing benchmark. Results for the new Pro and Flash, with others for scale: 77% Sonnet 67% Gemini 1.5 Pro 0827 66% Llama 405b 58% GPT 3.5 Turbo 0301 53% Gemini 1.5 Flash 0827
Tweet media one
11
11
137
@paulgauthier
Paul Gauthier
4 months
Aider is SOTA on the main SWE Bench, scoring 18.9% vs Devin at 13.9%, AmazonQ at 13.8% . So aider is now SOTA on both SWE Bench & SWE Bench Lite. Achieved via static code analysis, reliable LLM code editing, auto-fixing lint/test errors; not slow, expensive "agentic" behaviors.
Tweet media one
9
23
134
@paulgauthier
Paul Gauthier
1 month
Aider v0.54.0 - New: gemini/gemini-1.5-pro-exp-0827 & gemini/gemini-1.5-flash-exp-0827 - Shell cmds can be interactive, share output with the LLM - Install latest aider with --upgrade Aider wrote 64% of the code in this release. Full change log:
7
8
128
@paulgauthier
Paul Gauthier
2 months
Aider v0.49.0 - Add read-only files with /read, even from outside the git repo. - Paste images/text into the chat with /clipboard. - /web shows markdown version of scraped page. - Aider wrote 61% of the code in this release. See release notes for more:
6
12
127
@paulgauthier
Paul Gauthier
2 months
Aider v0.48.0 - Perf improvements for large/mono repos, including --subtree-only to focus on current subtree. - Add images from clipboard with /add-clipboard-image - Sonnet 8k output. - Aider wrote 44% of the code in this release.
5
11
122
@paulgauthier
Paul Gauthier
3 months
Sonnet is the opposite of lazy! It's so diligent it writes more code than the 4k output token limit. Aider now lets Sonnet spread code across multiple API calls. This jumped it from 55->64% on the refactoring leaderboard.
6
17
117
@paulgauthier
Paul Gauthier
12 days
New Qwen 2.5 models are on aider's leaderboard: 65% Qwen-2.5-72b-instruct 52% Qwen2.5-coder:7b-instruct-q8_0 The 7b model scored below the 57% reported by the Qwen Coder team, likely due to quantization. Thanks @youknow04 for benchmarking the 7b model.
5
17
106
@paulgauthier
Paul Gauthier
2 months
Aider v0.46.0 - New /ask <question> command to ask about your code, without making any edits. - New /chat-mode <mode> command to switch chat modes (ask, help, code). - Aider wrote 45% of the code in this release.
4
6
96
@paulgauthier
Paul Gauthier
27 days
Yi-Coder scored below GPT-3.5 on aider's code editing benchmark. GitHub user cheahjs submitted results for the full 9b model and a q4_0 version. Results, with other models for scale: 77% Sonnet 58% GPT-3.5 54% Yi-Coder-9b-Chat 45% Yi-Coder-9b-Chat-q4_0
Tweet media one
5
13
94
@paulgauthier
Paul Gauthier
2 months
Aider v0.47.0 - Now uses "Conventional Commit" messages, customize with --commit-prompt. - New docker image, includes all extras. - Better lint flow. The first release where aider wrote more code than I did (58% vs 42%).
4
8
93
@paulgauthier
Paul Gauthier
3 months
GPT 4o mini scores like the original GPT 3.5 on aider's code editing benchmark (later 3.5s were worse). It doesn't seem capable of editing code with diffs on first blush, which limits its use to smaller files.
5
6
89
@paulgauthier
Paul Gauthier
27 days
DeepSeek Chat V2.5 scored ~same as Coder V2 on aider's code editing benchmark. V2.5 did ~same as V2 on the challenging refactor benchmark. Still far behind Opus/Sonnet/GPT-4o there, so probably less useful for real world coding tasks.
Tweet media one
7
13
89
@paulgauthier
Paul Gauthier
22 days
Great new aider tutorial from @IndyDevDan showing an excellent workflow for incrementally building non-trivial software. This isn't a snake game or one-shot todo app prototype. Aider works best when you build big things in small steps.
3
7
86
@paulgauthier
Paul Gauthier
2 months
Mistral Large 2 (2407) scored only 60% on aider's code editing benchmark. This puts it just ahead of the best GPT-3.5 model. It doesn't seem able to reliably use search/replace to efficiently edit code. It's been a busy day of leaderboard updates!
4
10
81
@paulgauthier
Paul Gauthier
3 months
Claude 3.5 Sonnet is now the top ranked model on aider’s code editing leaderboard! DeepSeek Coder V2 took the #1 spot only 4 days ago. Sonnet ranked #1 with the “whole” editing format. It also scored very well with aider’s “diff” editing format.
Tweet media one
3
16
77
@paulgauthier
Paul Gauthier
2 months
Aider's token usage at OpenRouter is up 3X in 3 weeks! Now the #10 app, using 111M tok/day. Aider is their #2 user of Sonnet and #12 of GPT-4o. Aider users seem to like @OpenRouterAI 's high rate limits and good reliability.
Tweet media one
3
5
77
@paulgauthier
Paul Gauthier
28 days
Aider v0.55.0 is a major bugfix release. - Offer to create a GitHub Issue pre-filled with info about any crashes. - Numerous corner case bug fixes submitted via pre-filled issues. - Aider wrote 53% of the code in this release. Full change log:
7
11
76
@paulgauthier
Paul Gauthier
2 months
DeepSeek Chat V2 0628 from @deepseek_ai is #4 on aider's code editing leaderboard! Just ahead of Opus, close behind GPT-4o. It can efficiently edit large files using diffs. Priced lower than GPT-4o mini, but with frontier coding skill!
3
10
68
@paulgauthier
Paul Gauthier
3 months
Aider v0.43.0 Aider now has a built in chatbot that you can ask for help about using aider, customizing settings, troubleshooting, using LLMs, etc. Type "/help <question>" and aider will respond with helpful information.
4
6
68
@paulgauthier
Paul Gauthier
4 months
@deepseek_ai I can confirm. DeepSeek Coder V2 is now on top of aider's code editing leaderboard, ahead of GPT-4o and Opus. It benchmarked even better for me than the result shown in @deepseek_ai 's graph.
Tweet media one
1
12
66
@paulgauthier
Paul Gauthier
3 months
Aider v0.44.0 - Default pip install size reduced by 3-12x. - Added 3 package extras, which aider will offer to install when needed: - aider-chat[help] - aider-chat[browser] - aider-chat[playwright] - Aider wrote 29% of the code in this release.
6
6
65
@paulgauthier
Paul Gauthier
3 months
Aider is #1 for DeepSeek Coder V2 on OpenRouter, and #8 for Claude 3.5 Sonnet this week. That's a lot of AI pair programming productivity!
Tweet media one
5
1
54
@paulgauthier
Paul Gauthier
21 days
The 08-2024 versions of Command-R and Command-R+ perform almost identically on aider's code editing benchmark. They come in above the older version of Command-R+ and just behind our favorite Reflection 70B model.
Tweet media one
5
6
53
@paulgauthier
Paul Gauthier
5 months
GPT-4o tops the aider LLM code editing leaderboard at 72.9%, versus 68.4% for Opus. GPT-4o takes second on aider's refactoring leaderboard with 62.9%, versus Opus at 72.3%. GPT-4o did much better than the 4-turbo models, and seems *much* less lazy.
1
9
48
@paulgauthier
Paul Gauthier
3 months
Aider v0.42.0 - Performance release: - 5X faster launch! - Faster auto-complete in large git repos (users report ~100X speedup)!
5
2
43
@paulgauthier
Paul Gauthier
25 days
For clarity, the 42% score was without the specific recommended system prompt. With that prompt, it scored 43%.
5
0
41
@paulgauthier
Paul Gauthier
5 months
Aider now has LLM leaderboards that rank popular models according to their ability to edit code. Includes GPT-3.5/4 Turbo, Opus, Sonnet, Gemini 1.5 Pro, Llama 3, Deepseek Coder & Command-R+.
Tweet media one
3
7
40
@paulgauthier
Paul Gauthier
2 months
Llama 3.1 405B instruct from @AIatMeta is #7 on aider's leaderboard, well behind Claude 3.5 Sonnet & GPT-4o. When using SEARCH/REPLACE to efficiently edit code, it drops to #11 .
3
8
40
@paulgauthier
Paul Gauthier
4 months
Aider v0.38.0 - Use --vim for vim keybindings in the chat. - Add LLM metadata via .aider.models.json file (by caseymcc). - More detailed error messages on token limit errors. - Single line commit messages, without the recent chat messages.
0
4
39
@paulgauthier
Paul Gauthier
1 month
@itsPaulAi This is a great video, thanks for making and sharing it. I've added it to aider's tutorial page.
1
6
36
@paulgauthier
Paul Gauthier
3 months
Aider v0.41.0 - Allow Claude 3.5 Sonnet to stream back >4k tokens! - It is the first model capable of writing such large coherent, useful code edits. - Do large refactors or generate multiple files of new code in one go. - Aider now uses `claude-3-5-sonnet-20240620` by
2
4
32
@paulgauthier
Paul Gauthier
1 month
@karpathy @_devalias Aider has had this for awhile. You can have it fix any linting nits and make a commit message for you with: aider —lint —commit
2
1
31
@paulgauthier
Paul Gauthier
3 months
Aider v0.40.0 - Improved prompting to discourage Sonnet from wasting tokens emitting unchanging code ( #705 ). - Improved error info for token limit errors. - Options to suppress adding "(aider)" to the git author and committer names. - Use --model-settings-file to customize
1
0
31
@paulgauthier
Paul Gauthier
3 months
Aider v0.39.0 - Use --sonnet for Claude 3.5 Sonnet, which is the top model on aider's LLM code editing leaderboard. - All AIDER_xxx environment variables can now be set in `.env` (by jpshack-at-palomar). - Use --llm-history-file to log raw messages sent to the LLM (by
3
5
29
@paulgauthier
Paul Gauthier
3 months
Aider v0.45.0 - Support for GPT 4o mini, using the whole edit format. - Aider is better at offering to add files to the chat on Windows. - Aider wrote 42% of the code in this release.
0
4
29
@paulgauthier
Paul Gauthier
28 days
Aider wrote >800 lines of code in this release. A new record.
Tweet media one
2
1
26
@paulgauthier
Paul Gauthier
3 months
Aider is the #20 agent using OpenRouter this week, with 249M tokens of AI coding goodness! Aider uses @ishaan_jaff 's LiteLLM, and is responsible for more than half of LiteLLM's total OpenRouter usage.
Tweet media one
1
1
21
@paulgauthier
Paul Gauthier
2 months
@aaron__vi The problem isn’t the json. That’s fine. The code inside is worse and has syntax errors.
2
0
19
@paulgauthier
Paul Gauthier
7 months
Claude 3 beat GPT-4 on aider's code editing benchmark. I’ve been benchmarking the Claude 3 models using Aider’s code editing benchmark suite. Opus outperforms all of OpenAI’s models, making it the best available model for pair programming with AI.
Tweet media one
2
5
17
@paulgauthier
Paul Gauthier
1 month
@victormustar For pragmatic productivity actually I recommend the opposite. Let the AI do the portion it can do. Don't waste time forcing it to do everything. Take over, code past the friction, have the AI start the next chunk of work. This accelerates coders, but doesn't help non-coders.
2
0
17
@paulgauthier
Paul Gauthier
4 months
Aider has written 7% of its own code, via 600+ commits that inserted 4.8K and deleted 1.5K lines of code.
2
1
17
@paulgauthier
Paul Gauthier
4 months
The code for running aider against SWE Bench Lite is up on GitHub. Here's the ~20 line pseudo-code for the "aider agent" that obtained the SOTA result.
Tweet media one
1
0
16
@paulgauthier
Paul Gauthier
4 months
Aider now lints your code after every LLM edit, and offers to automatically fix any errors. It uses tree-sitter to both lint the code and send any errors to the LLM surrounded by code context.
3
0
15
@paulgauthier
Paul Gauthier
2 months
@Shawnryan96 No doubt. The price and speed improvements over the last year have been astounding.
1
0
13
@paulgauthier
Paul Gauthier
19 days
o1-mini did significantly worse with aider's "diff" edit format, which allows models to efficiently specify changes using search/replace blocks. Most other frontier models do well with the diff format. Updating here as results complete:
Tweet media one
2
1
13
@paulgauthier
Paul Gauthier
4 months
@BorisMPower made a fair point on the layout of the previous SWE Bench Lite graph. It has been corrected. Also, it now shows apples-to-apples comparisons, using pass @1 from AutoCodeRover and unhinted OpenDevin results.
Tweet media one
1
2
13
@paulgauthier
Paul Gauthier
2 months
@AdjectiveAlli All LLMs have a limit on how many tokens they can output. When they hit the limit, they stop and raise an error. Aider works around this by asking them to continue where they left off in the next message.
1
0
12
@paulgauthier
Paul Gauthier
6 months
OpenAI just released GPT-4 Turbo with Vision and it performs worse on aider’s benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models.
Tweet media one
3
4
12
@paulgauthier
Paul Gauthier
2 months
@AdjectiveAlli The models can output as much code as they want, without regard for their output token limit.
1
1
11
@paulgauthier
Paul Gauthier
5 months
Use aider's new browser UI to collaborate with LLMs to edit code in your local git repo.
2
1
11
@paulgauthier
Paul Gauthier
1 month
@drivelinekyle I just updated the tips section of the docs. Let me know if this is helpful.
0
0
11
@paulgauthier
Paul Gauthier
11 days
@amirpc I tuned aider to be more permissive & accept the type of malformed search/replace blocks that o1-preview generates. I do this sort of prompt and editing backend optimization with all the top models. Aider tries to be very permissive about accepting edits from LLMs.
2
0
12
@paulgauthier
Paul Gauthier
1 month
@tom_doerr Aider tunes the conversation to be cachable. I think it’s a win if you send even just 2 messages. Cache write is 1.25x cost. Hit is 0.1x. So you pay 1.35x instead of 2x cost.
1
1
11
@paulgauthier
Paul Gauthier
5 months
@teortaxesTex The deepseek v2 result was unexpected, yes. Surprisingly it was able to use the "diff" edit format. Most smaller models are only able to send back "whole" copies of source files with updates included. Being able to send diffs allows edits to large files and saves tokens.
1
0
9
@paulgauthier
Paul Gauthier
1 month
@Sanity You can do this in aider: /git checkout -b branch-name Many folks have strong feelings about branch strategies, so hard for aider to directly adopt one.
1
0
9
@paulgauthier
Paul Gauthier
28 days
@RockzMRockz @deepseek_ai @AnthropicAI Yes aider supports prompt caching for deepseek too.
1
0
8
@paulgauthier
Paul Gauthier
8 months
I ran the new `gpt-4-0125-preview` model through aider's laziness benchmark. Even though OpenAI claims it is less lazy, it appears to be lazier than the previous `gpt-4-1106-preview` model.
Tweet media one
1
1
7
@paulgauthier
Paul Gauthier
2 months
@eyal_eg Sonnet.
0
0
8
@paulgauthier
Paul Gauthier
25 days
@ruben_kostard I re-ran it with the specified system prompt and got 43%. So it didn't seem to make a difference.
1
0
8
@paulgauthier
Paul Gauthier
5 months
Just now added @deepseek_ai 's newest DeepSeek-V2 model, which scores better than everything except GPT-4 variants and Opus.
1
0
6
@paulgauthier
Paul Gauthier
5 months
@anthony_barker That's a good question! GPT-3.5 models seem to get worse at code editing with each release. The GPT-4 models the same, until 4o. No other model family has enough history to establish a trend. So many new models capable of code editing released recently.
Tweet media one
0
3
7
@paulgauthier
Paul Gauthier
3 months
@NickADobos Maybe. Mini only scored 55% on aider's benchmark, which is underwhelming. Aider supports unlimited output tokens with Sonnet, which has the top benchmark score at 77%. It's my daily driver these days, and is changing how I code with AI.
0
0
7
@paulgauthier
Paul Gauthier
1 month
@NickADobos Some of those data points use prompt caching. When I added prompt caching to aider, I benchmarked it to make sure it didn't cause a performance regression.
0
0
7
@paulgauthier
Paul Gauthier
14 years
hello, twitter
1
1
6
@paulgauthier
Paul Gauthier
5 months
@amanrsanger Super interesting work! Sounds like you use a strong model (opus/gpt-4o) to generate code changes and a weak model to "apply" them to the file? I've played with this for a ~year, but had concerns: 1. Adding a 2nd inference step adds latency. Your work helps here! 2. Can only
1
0
3
@paulgauthier
Paul Gauthier
1 month
@tom_doerr When streaming Anthropic doesn't return the underlying cache hit data needed for costs.
1
0
6
@paulgauthier
Paul Gauthier
4 months
@OfirPress Copilot is autocomplete. People seem to use "agent" to mean a black box that takes a (hopefully) unambiguous & complete task description; it unilaterally burns time & tokens to (sometimes) solve it. Aider lives in between: a pair programming UX, where you collaborate with AI.
0
0
5
@paulgauthier
Paul Gauthier
2 months
@meekaale Yes, exactly. After you are done /ask-ing questions you can tell aider "ok, do it" and it will code based on that discussion.
0
0
6
@paulgauthier
Paul Gauthier
4 months
@yeemanchoi All of aider's results are unhinted pass @1 results. They are presented along side apples-to-apples comparisons with unhinted pass @1 results from the other agents. I have updated the article to make this more clear.
Tweet media one
0
1
6
@paulgauthier
Paul Gauthier
27 days
@USEnglish215753 The DeepSeek V2 Chat and DeepSeek Coder V2 models have been merged and upgraded into the new model, DeepSeek V2.5.
1
0
6
@paulgauthier
Paul Gauthier
2 months
@redlinetheturk Yes, the rate limits are preventing me from benchmarking it.
0
0
5
@paulgauthier
Paul Gauthier
1 month
@ellev3n11 No benchmark is perfect. I use benchmarks to get pragmatic, quantitative guidance on how changes to aider affect coding performance and to roughly assess how well models can edit code.
0
0
5
@paulgauthier
Paul Gauthier
1 month
@abacaj @altryne You can have aider write commit messages for the code you write too. I mostly do this now to automagically fix all linting nits and add a commit message: aider --lint --commit
0
0
5
@paulgauthier
Paul Gauthier
5 months
@ashutoshmehra The refactoring leaderboard is directly relevant to laziness. See the original article that describes the benchmark and how it was used to provoke, assess and ultimately reduce the laziness of GPT 4 Turbo.
1
0
5
@paulgauthier
Paul Gauthier
1 month
@simonw The README now has more info on the benchmark report yaml. It contains everything needed to reproduce a run, including the git hash of the repo. If you check out that hash, you can see the prompt (and all code) that was used.
0
0
4
@paulgauthier
Paul Gauthier
5 months
@NickADobos I actually benchmarked gpt-4-turbo preview models at different times of day to detect any load/peak based laziness effects. Didn't find any evidence of that despite a significant effort. This work was part of this effort:
0
0
4
@paulgauthier
Paul Gauthier
3 months
@dean_rie I've tried this and quantitatively assessed it against SWE Bench Lite. It didn't help, and actually made things worse.
0
0
4
@paulgauthier
Paul Gauthier
4 months
@eyal_eg The SOTA result from up to 6 iterations with GPT-4o & Opus cost $908.83 to run all 300 problems. So $3.03 per problem. Using just GPT-4o for a single iteration tied the prior SOTA. It cost $130.91 for all 300, so $0.44 per problem.
1
0
2
@paulgauthier
Paul Gauthier
2 months
@bitdeep_ Sonnet mostly.
1
0
4