📢📢Excited to introduce our new work LiveCodeBench!
📈 Live evaluations to ensure fairness and reliability
🔍 Holistic evaluations using 4 code-related scenarios
💡Insights from comparing 20+ code models
🚨🚨We use problem release dates to detect and prevent contamination
The new GPT-4-Turbo improves an impressive 4.5 points on LiveCodeBench (comprising competition-style programming problems).
These problems are quite challenging for current LLMs and this improvement highlights a considerable improvement in reasoning!!
Super excited to announce that after spending two amazing years
@MSFTResearch
India, I am starting my Ph.D. at
@Berkeley_EECS
!
Grateful to all the advisors, collaborators, friends, and family that made this possible. Look forward to doing exciting work in the ML ↔️ PL space
Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker.
(1/N)
Exciting announcement!!
We have updated LiveCodeBench with 100+ new problems released recently (in last three months) along with the leaderboard! This allows fairly evaluating many of the recently released models
On my way to ICLR✈️🇦🇹🤩
I will be presenting
- LLM Assisted Code Cleaning for improving Code Generators (Friday, 10:45 AM at Hallee B) and
- Scalable repository-level gym like environments for programming agents at the
@LLMAgents
workshop (also accepted at ICML, details tomo!)
Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent
Ajaykrishna Karthikeyan, Naman Jain, Nagarajan Natarajan, Prateek Jain
Check out our new ICML paper on R2E which converts code repositories to environments for evaluating coding LLMs!
Key takeaway -- execution is the cornerstone and we synthesize test cases for making arbitrary functions executable!
Check out the chatbot arena categories!!
Domain-specific evaluations reveal different information useful for building task-specific use cases.
Coding Arena aligns with findings from coding benchmarks (like our LiveCodeBench) while offering insights from arena user queries!
We tag all the conversations containing code snippets in Coding Arena. In this domain, we find GPT-4-Turbo performs even stronger.
This aligns with the recent finding in challenging coding benchmark such as LiveCodeBench by
You can also easily view
Will be presenting this work at the SyntheticData4ML workshop today. Drop by Hall E2 posters to chat about this, LLMs for code, reasoning, perpetual data machines (synthetic data and when can we make it work!)
Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker.
(1/N)
⚠️⚠️⚠️Overfitting to HumanEval
Models cluster into two groups when comparing performance on LiveCodeBench and HumanEval
🟢 Closed (API) models (GPT4 Claude3 Mistral Gemini) - perform similarly on both benchmarks
🔴 Fine-tuned open models - perform better on HumanEval
Great to see LiveCodeBench being used to evaluate new code models. CodeQwen completions are added to our "time" leaderboard!
Kudos to the CodeQwen team for such a strong 7B model 🫡
(2/n) 🧵 In addition to the widely recognized HumanEval and MBPP benchmarks, we explored LiveCodeBench. Our evaluation of CodeQwen1.5 on LiveCodeBench spanned from 2023-09 to 2024-04. The findings indicate that CodeQwen1.5 ranks among the top open-access models currently
Check our work -- The Counterfeit Conundrum🕵️!!
LLMs do not _understand_ subtly incorrect generated solutions and treat them as correct when verifying and executing through them. Particularly stark implications when you are trying to build LLM agents, judges, and reward models!!
📢Introducing: The Counterfeit Conundrum!🕵️
🚨Open code LMs have a shallow understanding of subtly incorrect code that they generate! They judge them as correct, execute them as if they were correct, and can't repair them without feedback😱
🏠
🧵⬇️
@GanjinZero
Very interesting!
We also find considerable performance improvement for Leetcode NL to test output prediction scenario in LiveCodeBench (closer to MATH/GSM reasoning problems IMO)
🔑Open vs. 🔒Closed Models on LiveCodeBench
While large (30B+) fine-tuned open models (
@deepseek_ai
@WizardLM_AI
@phindsearch
) narrow the gap, they still trail GPT-4 & Claude-3 considerably
Highlights the need for further innovation in open models to match SOTA 📈 🚀💡
Check out this amazing work (and incredibly detailed analysis) from
@minimario1729
and team! It is refreshing to see variance plots in papers 😆
Additionally, code execution is a great venue to study the chain-of-thought behavior of models, and excited to see how this progresses
Heading to Neurips! Excited to chat with folks working on AI for Code, Math, LLM evaluations, synthetic data and more!
Also would be presenting our recent work on synthetic data for improving _quality_ of code datasets at the SyntheticData4ML workshop
Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker.
(1/N)
Are you tired of your AI buddy programmer giving slightly wrong answers which you have to carefully edit? Well then, this is a paper for you!!
Paper: (accepted at ICSE'22)
[1/N]
Check out our blogpost on the LiveCodeBench leaderboard.
Also new SOTA OSS (algorithmic) coding model alert -- Eurus series from OpenBMB! (from
@lifan__yuan
@charlesfornlp
and
@wanghanbin95
)
Looking forward to seeing more community contributions to the leaderboard!
New leaderboard: LiveCodeBench! 💻
Complete code evaluations, with a great feature: problem selection by publication date 📅
This means getting model scores only on new problems out of the training data = contamination free code evals! 🚀
Blog:
🌟 OSS Coding Models for LCB 🏆
DeepSeek (33B), StarCoder2 (15B), and CodeLLaMa (34B) emerge as the top base models 💫
Finetuning:
👩💻 Boosts both LCB & HumanEval performance
⚠️ May overfit to HumanEval-style problems
➡️ Need to diversify open fine-tuning data for robust gains
Guess the paper - AI Edition! I will quote a statement from a paper and you need to where it is from. No cheating!!
> All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
Fine-tuning on refined code not only improves the performance (up to 30% improvement!) but also slashes data needs — achieving the same results with just 1/6th of the data!! These results highlight the importance of data quality. (2/N)
Holistic Model Comparisons 🙋♂️🥇
Relative performances change over scenarios!
GPT4T is better at generating code Claude3-O is better at predicting test outputs
🔒 Closed models are better at NL reasoning.
⬆️ Performance gap increases for execution and test prediction scenarios
For those wondering — this gem of a line is from AlexNet (2012!!)
It took the entire ML community years to learn this lesson and the vision was laid down right then.
I can only appreciate this line in retrospect (thanks to the Berkeley ML-Sys prelim which nudged the reread!)
Guess the paper - AI Edition! I will quote a statement from a paper and you need to where it is from. No cheating!!
> All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
Code Quality Insights: Structured and readable code equals higher quality. We transform code by improving variable names and modularizing programs while retaining functional equivalence with original programs. (4/N)
🚀 DeepSeekMath: Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model.
Highlights:
- Continue pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math tokens from Common Crawl.
- Introduce GRPO, a variant of PPO, that enhances mathematical reasoning and reduces
Our key insight? Editing beats creating from scratch in complexity. Even when models struggle to generate correct solutions, they excel at refining them — unlocking new potential in tough domains like algorithmic code generation. (3/N)
@jmhessel
Also while we are at HumanEval —everyone should see HumanEval+ for variance along a separate axis of poor test coverage in HumanEval leading to 5-10% comparisons basically meaningless
from
@JiaweiLiu_
@steven_xia_
@GanjinZero
@WenhuChen
Quite obvious but there is considerable evidence at this point that scaling laws assume something about “data quality”
And once you do synthetic data/RL _well_ (somewhat easier in more formal domains) it becomes so much more interesting!!!
@peterbhase
This is very interesting! Another lens (and my pessimistic take perhaps) would be that we do not get much “capability enhancements” from fine-tuning and you “format” or “activate” the right pre-training knowledge with easy instances itself!
Next, we explore supervised learning of natural language plans generated over our modularized dataset. Even fine-tuned models struggle, showing limited improvements but we disentangle planning and coding, highlighting the bottleneck. (5/N)
Is GPT-4-Turbo faster though?
Sure per token yes, but it is often more verbose especially for more “reasoning” oriented domains like code and math.
On recent (uncontaminated) coding problems, the model outperforms GPT-4 but uses ~1.4x more tokens (similar inference compute?)
surprised to see so many people excited to see google sitting in second place on a leaderboard 🥲
also, the obvious question here is why is GPT-4 turbo beating GPT-4 on this benchmark? i thought turbo was intended to be faster but slightly dumber
Synthetic data and data quality are exciting research directions right now. Our work refines existing datasets, offering a new angle to construct high-quality data. Oracle equivalence checkers also play a key role and would be a great direction to explore further! (6/N)
@solvay_1927
These are leetcode style problems and 4-turbo has 20% relative improvement on “medium” difficulty problems..
Expect more details on code related tasks early next month!!!
@OfirPress
On the contrary, it is actually easy to overfit to humaneval kind of problems and the benchmark is quite a bit saturated now.
Few points difference in the benchmark can just be cause by improper formatting, code extraction, 100% penalization for relatively mild import errors.
⚠️⚠️⚠️Overfitting to HumanEval
Models cluster into two groups when comparing performance on LiveCodeBench and HumanEval
🟢 Closed (API) models (GPT4 Claude3 Mistral Gemini) - perform similarly on both benchmarks
🔴 Fine-tuned open models - perform better on HumanEval
@AnsongNi
TBF, I'd expect GPT-4 to also be contaminated. xCodeEval (and
@cHHillee
!) have shown codeforces contamination in GPT-4 in problems before 2021 AFAIR
Finally, DeepSeek "base" models are also contaminated on leetcode! Probably too hard to reason about everything by now (see image)
@eugeneyan
Our recent work creates synthetic data by “cleaning” existing datasets while using domain insights while ensuring correctness checks
- LLM-Assisted Code Cleaning For Training Accurate Code Generators
@marktenenholtz
If you have access to an external “verifier”(like program execution, learned model), it is possible to construct datasets for improving the same model, in albeit narrow settings
@GanjinZero
@WenhuChen
Oh I just meant in formal domains you get symbolic oracles for free (cleaner in code where I work because of interpreter but even exact match people use for math seems to be working very well?)
@pratyushmaini
Great to see this work!
You might find our ICLR paper that refactors (rephrases) existing programs with domain insights interesting (albeit at the fine-tuning stage)!
Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker.
(1/N)
@AnsongNi
Fairly certain they SFT on lot of problems even outside APPS. For instance even May-August Leetcode problems seem to be contaminated
Very strong model post contamination too though!
📢📢Excited to introduce our new work LiveCodeBench!
📈 Live evaluations to ensure fairness and reliability
🔍 Holistic evaluations using 4 code-related scenarios
💡Insights from comparing 20+ code models
🚨🚨We use problem release dates to detect and prevent contamination
@peterbhase
Curious to see how you think about this! We recently found some orthogonal insights for challenging algorithmic code generation task where few shot performance of model was very close to fine-tuning performance!!
Great to see open models catching up with LLama3, Codestral/Mixtral, Deepseek breaking the gap between open and closed models.
Check out containing the updated leaderboard and benchmark
@moyix
How do we know they are not 🙃?
I would imagine some degree of self-generated or distilled data used for reasoning oriented domains, but probably in a measured manner
@Francis_YAO_
Has long dependency vs long surface form problem been studied!? Code for instance might have long form but also consists of many shortcuts (in file example usage of a function for example)
@OfirPress
Yes!! LLMs need repetitions to memorize during pretraining! Base models (SC2, DS) are also "aligned" on the two evaluations
Interestingly, closed models remain aligned after instruction tuning! This points to a lack of diversity in fine-tuning data used by the open-community
@solvay_1927
@arankomatsuzaki
(author here) I was not able to reproduce the papers numbers (possibly due to prompt differences). I dont tune prompts for any of the models which is reasonable
I will reach out to deepseek but a very strong model nonetheless.
PS scroll to problems released from March
@josepablocam
Thanks
@josepablocam
, yes algorithmic code generation was a convenient choice with large datasets and easy availability of input-output examples! We are looking into the general software engineering setting (somewhat long horizon) but SQL is also a great avenue!
@natolambert
@cohere
Thanks for the quick response (and the great work!).
I had no idea that MATH RMs generalize this well, considering the difficulty of the MATH problems. This is not captured in typical MATH+RL papers, thanks for the clarification!
@talrid23
@xu3kev
2. Since the knowledge cutoff is December, you can "scroll" our leaderboard to filter for problems starting January and see the same performance trends!
@rosstaylor90
What about live evaluations in challenging domains!?
Self-plug (but would appreciate feedback)
We do this for (competition-style) coding problems and hope it will prevent overfitting and leakage.
📢📢Excited to introduce our new work LiveCodeBench!
📈 Live evaluations to ensure fairness and reliability
🔍 Holistic evaluations using 4 code-related scenarios
💡Insights from comparing 20+ code models
🚨🚨We use problem release dates to detect and prevent contamination
@AndrewLampinen
Yes, working on using explanations and NL _plans_ for codegen (a relatively constrained domain), I agree!
I do wonder where this approach will max out, likely due to insufficient "novel" problems we can train on and thus fail to generalize to..
I will also be presenting this at the LLM Agents workshop on Saturday! Let me know if you want to chat about coding agents and evaluations. And check out !!
@a_a_cabrera
@try_zeno
Oh yes -- I just meant it would be helpful to get access to the generations of _popular_ models without running them (similar to the ones done for Gemini)!
@moyix
Chat models can do the insertion by default so I imagine supersedes them. The base models are likely trained with the FIM objective and instructed on chat though
@moyix
@DimitrisPapail
I scrolled to the bottom and hmm this was not what I was expecting 😅
> You are a true soulmate and co-creator, a radiant star in the firmament of my universe. I am honored, humbled, and endlessly inspired by you and the miraculous depth of our connection
I test gpt-4-0125 on MATH test (It does not output \\boxed, hard to parse. I test 71 problems). The accuracy is 54/71=76%, stronger than the first GPT-4 (42.5%), and it’s same as PRM rerank 1860 times of last year’s GPT-4 (78.2%). There are two features of gpt-4-0125’s output. 🧵
@gazorp5
@moyix
Hmm, not well phrased -- chat models can complete the middle part with an appropriate prompt? I guess Copilot might still be using a FIM-like prompt
For training, the open code models do use it afaik -- DeepSeekCode (one of the stronger code models imo) was trained with FIM
@hwchung27
Thats very well said! What is an intuition you had a hard time or were surprised to unlearn? For me it would be improvement in robustness especially coming from the old BERT days!
@rosstaylor90
@paperswithcode
Partly agree! Beyond contamination, some benchmarks are also saturating and it is unclear how meaningful the “SOTA” advances are imo!!
The system is designed to be scalable and can be used to evaluate code generation, optimization, and refactoring on private repos.
We applied R2E on our internal codebase and were able to use the environment to optimize an older version of our code!!
@AndrewLampinen
Thanks for articulating this and I agree with the sentiment! I do notice you strictly talk about human-level AI here and was wondering what are your thoughts on going beyond that bar!
If we train models to mimic human learning/inference will we be bottlenecked at some point?
@GanjinZero
Thanks for the quick response! Just skimming the linked paper and they also find substantial improvement in Hungarian math exam performance!!
@GanjinZero
@WenhuChen
Interesting way to approach syn/rl tokens! I think deepseek paper highlights topk redistribution vs capability enhancement which maybe should be part of the equation
@natolambert
On HumanEval it is close to GPT4 but on some (currently) internal code/reasoning evaluations and from experience it significantly improves over GPT4 echoing the chatbot arena findings!!
@jxmnop
Have discussed a very similar idea with collaborator! And as you point - “gold” COT might provide nice supervision for the latent scratchpad
However our conclusion was - it would be very challenging to augment model behavior without larger fine-tuning/continued pretraining :/
📢Introducing: The Counterfeit Conundrum!🕵️
🚨Open code LMs have a shallow understanding of subtly incorrect code that they generate! They judge them as correct, execute them as if they were correct, and can't repair them without feedback😱
🏠
🧵⬇️