Naman Jain @StringChaos Twitter profile | Pikagi

Pikagi

Naman Jain

@StringChaos

1,210

Followers

1,079

Following

23

Media

316

Statuses

Research Intern @MetaAI | CS PhD @UCBerkeley | Projects - R2E, LiveCodeBench, Chatbot-Arena Coding, RAFT, Code Cleaning | Past: @AWS @MSFTResearch @iitbombay

Seattle

https://t.co/ZmKRqiCzcz

Joined March 2018

Don't wanna be here? Send us removal request.

Pinned Tweet

@StringChaos

Naman Jain

8 months

📢📢Excited to introduce our new work LiveCodeBench! 📈 Live evaluations to ensure fairness and reliability 🔍 Holistic evaluations using 4 code-related scenarios 💡Insights from comparing 20+ code models 🚨🚨We use problem release dates to detect and prevent contamination

Tweet media one

9

45

208

Last Seen Profiles

@EkateriniH88692

@cariforestella

@SSSAA_SD36

@VeniceVira12043

@nguyennhan2003

@OmeTvRandom

@castero97

@AXAFrance

@WillowanTrilloo

@AngliaKayl18627

@BessKandra

@TavarasTes4543

@Arkh91

@erincastillo_vo

@wildharlequinn

@KVahe89206

@Sinii5656

@DearonC1495

@jackjediqu77437

@YJaccob25464

@SherraStep85729

@phaedra8696

@PerkyRosie

@mediclubman

@mztokyodriftt

@DashaunTyl17012

@MemorialTigerBB

@CallsNyann57419

@DudaDandi

@poco_dandy

@terrazzo_end7

@istrismart1

@CapriaNeha60863

@TimothyHar86315

@blanketing_1943

@KNycole2320

@StringChaos

Naman Jain

7 months

The new GPT-4-Turbo improves an impressive 4.5 points on LiveCodeBench (comprising competition-style programming problems). These problems are quite challenging for current LLMs and this improvement highlights a considerable improvement in reasoning!!

Tweet media one

@polynoamial

Noam Brown

7 months

GPT-4 reasoning has been further improved

24

32

507

8

50

318

@StringChaos

Naman Jain

2 years

Super excited to announce that after spending two amazing years @MSFTResearch India, I am starting my Ph.D. at @Berkeley_EECS ! Grateful to all the advisors, collaborators, friends, and family that made this possible. Look forward to doing exciting work in the ML ↔️ PL space

8

2

213

@StringChaos

Naman Jain

11 months

Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. (1/N)

Tweet media one

4

33

208

@StringChaos

Naman Jain

5 months

Exciting announcement!! We have updated LiveCodeBench with 100+ new problems released recently (in last three months) along with the leaderboard! This allows fairly evaluating many of the recently released models

Tweet media one

2

12

77

@StringChaos

Naman Jain

6 months

On my way to ICLR✈️🇦🇹🤩 I will be presenting - LLM Assisted Code Cleaning for improving Code Generators (Friday, 10:45 AM at Hallee B) and - Scalable repository-level gym like environments for programming agents at the @LLMAgents workshop (also accepted at ICML, details tomo!)

3

6

56

@StringChaos

Naman Jain

2 years

Excited to annouce that our work on learning decision trees from bandit feedback has been accepted to TMLR!

@TmlrPub

Accepted papers at TMLR

2 years

Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent Ajaykrishna Karthikeyan, Naman Jain, Nagarajan Natarajan, Prateek Jain

0

2

9

1

2

43

@StringChaos

Naman Jain

6 months

Check out our new ICML paper on R2E which converts code repositories to environments for evaluating coding LLMs! Key takeaway -- execution is the cornerstone and we synthesize test cases for making arbitrary functions executable!

@slimshetty_

Manish Shetty

6 months

Want to turn your own GitHub Repos into a playground for 🤖 coding agents? 📢📢 Introducing R2E: Repository to Environment 📈 Scalable, dynamic, real-world repo-level benchmarks 💡 Generate Equivalence Tests Harnesses 🔗 | Accepted @ ICML '24 🧵

Tweet media one

3

27

155

1

8

41

@StringChaos

Naman Jain

7 months

Check out the chatbot arena categories!! Domain-specific evaluations reveal different information useful for building task-specific use cases. Coding Arena aligns with findings from coding benchmarks (like our LiveCodeBench) while offering insights from arena user queries!

@lmarena_ai

lmarena.ai (formerly lmsys.org)

7 months

We tag all the conversations containing code snippets in Coding Arena. In this domain, we find GPT-4-Turbo performs even stronger. This aligns with the recent finding in challenging coding benchmark such as LiveCodeBench by You can also easily view

Tweet media one

3

10

115

0

8

36

@StringChaos

Naman Jain

11 months

Will be presenting this work at the SyntheticData4ML workshop today. Drop by Hall E2 posters to chat about this, LLMs for code, reasoning, perpetual data machines (synthetic data and when can we make it work!)

@StringChaos

Naman Jain

11 months

Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. (1/N)

Tweet media one

4

33

208

0

6

36

@StringChaos

Naman Jain

8 months

⚠️⚠️⚠️Overfitting to HumanEval Models cluster into two groups when comparing performance on LiveCodeBench and HumanEval 🟢 Closed (API) models (GPT4 Claude3 Mistral Gemini) - perform similarly on both benchmarks 🔴 Fine-tuned open models - perform better on HumanEval

Tweet media one

1

4

35

@StringChaos

Naman Jain

7 months

Great to see LiveCodeBench being used to evaluate new code models. CodeQwen completions are added to our "time" leaderboard! Kudos to the CodeQwen team for such a strong 7B model 🫡

Tweet media one

@huybery

Binyuan Hui

7 months

(2/n) 🧵 In addition to the widely recognized HumanEval and MBPP benchmarks, we explored LiveCodeBench. Our evaluation of CodeQwen1.5 on LiveCodeBench spanned from 2023-09 to 2024-04. The findings indicate that CodeQwen1.5 ranks among the top open-access models currently

Tweet media one

1

1

22

0

5

34

@StringChaos

Naman Jain

8 months

Check our work -- The Counterfeit Conundrum🕵️!! LLMs do not _understand_ subtly incorrect generated solutions and treat them as correct when verifying and executing through them. Particularly stark implications when you are trying to build LLM agents, judges, and reward models!!

@minimario1729

Alex Gu

8 months

📢Introducing: The Counterfeit Conundrum!🕵️ 🚨Open code LMs have a shallow understanding of subtly incorrect code that they generate! They judge them as correct, execute them as if they were correct, and can't repair them without feedback😱 🏠 🧵⬇️

Tweet media one

1

18

94

0

7

30

@StringChaos

Naman Jain

7 months

@GanjinZero Very interesting! We also find considerable performance improvement for Leetcode NL to test output prediction scenario in LiveCodeBench (closer to MATH/GSM reasoning problems IMO)

@xu3kev

Wen-Ding Li

7 months

A big jump in math/reasoning for our coding benchmark 🤯

Tweet media one

26

135

909

1

5

28

@StringChaos

Naman Jain

8 months

🔑Open vs. 🔒Closed Models on LiveCodeBench While large (30B+) fine-tuned open models ( @deepseek_ai @WizardLM_AI @phindsearch ) narrow the gap, they still trail GPT-4 & Claude-3 considerably Highlights the need for further innovation in open models to match SOTA 📈 🚀💡

Tweet media one

2

5

26

@StringChaos

Naman Jain

8 months

Joint work from a super fun collaboration with @kingh0730 @minimario1729 @xu3kev @fanjia_yan @tianjun_zhang @sidawxyz Armando Solar-Lezama @koushik77 and Ion Stoica across UC Berkeley, MIT, and Cornell! Paper URL - Keep reading for the key takeaways!!

Tweet card media

LiveCodeBench: Holistic and Contamination Free Evaluation of Large...

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved...

2

4

23

@StringChaos

Naman Jain

10 months

Check out this amazing work (and incredibly detailed analysis) from @minimario1729 and team! It is refreshing to see variance plots in papers 😆 Additionally, code execution is a great venue to study the chain-of-thought behavior of models, and excited to see how this progresses

@minimario1729

Alex Gu

10 months

📢Introducing CRUXEval, a benchmark to measure Python code execution! 🏠Homepage: 📜Paper: 🏆Leaderboard: 🔎Sample Explorer: 📊HF Dataset:

Tweet media one

6

49

246

0

2

19

@StringChaos

Naman Jain

11 months

Heading to Neurips! Excited to chat with folks working on AI for Code, Math, LLM evaluations, synthetic data and more! Also would be presenting our recent work on synthetic data for improving _quality_ of code datasets at the SyntheticData4ML workshop

@StringChaos

Naman Jain

11 months

Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. (1/N)

Tweet media one

4

33

208

0

0

19

@StringChaos

Naman Jain

8 months

Check out our paper, datasets, and leaderboard for more details! 📜Paper - 🤗Huggingface - 🥇Leaderboard -

livecodebench (Live Code Bench)

0

5

17

@StringChaos

Naman Jain

3 years

Are you tired of your AI buddy programmer giving slightly wrong answers which you have to carefully edit? Well then, this is a paper for you!! Paper: (accepted at ICSE'22) [1/N]

Tweet card media

Jigsaw: Large Language Models meet Program Synthesis

Large pre-trained language models such as GPT-3, Codex, and Google's language model are now capable of generating code from natural language specifications of programmer intent. We view these...

1

7

18

@StringChaos

Naman Jain

11 months

@prerationalist Indeed this is in the prompt!

Tweet media one

1

0

14

@StringChaos

Naman Jain

7 months

Check out our blogpost on the LiveCodeBench leaderboard. Also new SOTA OSS (algorithmic) coding model alert -- Eurus series from OpenBMB! (from @lifan__yuan @charlesfornlp and @wanghanbin95 ) Looking forward to seeing more community contributions to the leaderboard!

Tweet media one

@clefourrier

Clémentine Fourrier 🍊

7 months

New leaderboard: LiveCodeBench! 💻 Complete code evaluations, with a great feature: problem selection by publication date 📅 This means getting model scores only on new problems out of the training data = contamination free code evals! 🚀 Blog:

1

13

44

0

6

15

@StringChaos

Naman Jain

6 years

@ankurhandos Well the plot was also something!

Tweet media one

0

0

13

@StringChaos

Naman Jain

8 months

🌟 OSS Coding Models for LCB 🏆 DeepSeek (33B), StarCoder2 (15B), and CodeLLaMa (34B) emerge as the top base models 💫 Finetuning: 👩‍💻 Boosts both LCB & HumanEval performance ⚠️ May overfit to HumanEval-style problems ➡️ Need to diversify open fine-tuning data for robust gains

1

1

13

@StringChaos

Naman Jain

10 months

Guess the paper - AI Edition! I will quote a statement from a paper and you need to where it is from. No cheating!! > All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

1

0

11

@StringChaos

Naman Jain

11 months

Fine-tuning on refined code not only improves the performance (up to 30% improvement!) but also slashes data needs — achieving the same results with just 1/6th of the data!! These results highlight the importance of data quality. (2/N)

Tweet media one

1

1

11

@StringChaos

Naman Jain

8 months

Holistic Model Comparisons 🙋‍♂️🥇 Relative performances change over scenarios! GPT4T is better at generating code Claude3-O is better at predicting test outputs 🔒 Closed models are better at NL reasoning. ⬆️ Performance gap increases for execution and test prediction scenarios

Tweet media one

1

1

9

@StringChaos

Naman Jain

10 months

For those wondering — this gem of a line is from AlexNet (2012!!) It took the entire ML community years to learn this lesson and the vision was laid down right then. I can only appreciate this line in retrospect (thanks to the Berkeley ML-Sys prelim which nudged the reread!)

@StringChaos

Naman Jain

10 months

Guess the paper - AI Edition! I will quote a statement from a paper and you need to where it is from. No cheating!! > All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

1

0

11

0

0

8

@StringChaos

Naman Jain

11 months

Code Quality Insights: Structured and readable code equals higher quality. We transform code by improving variable names and modularizing programs while retaining functional equivalence with original programs. (4/N)

Tweet media one

1

0

9

@StringChaos

Naman Jain

9 months

As others have pointed out, absolute banger paper worth a deep read!

@deepseek_ai

DeepSeek

9 months

🚀 DeepSeekMath: Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model. Highlights: - Continue pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math tokens from Common Crawl. - Introduce GRPO, a variant of PPO, that enhances mathematical reasoning and reduces

Tweet media one

22

168

957

0

0

9

@StringChaos

Naman Jain

11 months

Our key insight? Editing beats creating from scratch in complexity. Even when models struggle to generate correct solutions, they excel at refining them — unlocking new potential in tough domains like algorithmic code generation. (3/N)

1

0

9

@StringChaos

Naman Jain

11 months

@jmhessel Also while we are at HumanEval —everyone should see HumanEval+ for variance along a separate axis of poor test coverage in HumanEval leading to 5-10% comparisons basically meaningless from @JiaweiLiu_ @steven_xia_

0

0

8

@StringChaos

Naman Jain

11 months

@tianjun_zhang @infwinston @profjoeyg @koushik77 Also accepted at the SyntheticData4ML workshop at Neurips! See you there!

0

0

7

@StringChaos

Naman Jain

7 months

Check out for the complete leaderboard (we are still evaluating the other scenarios)!

0

1

8

@StringChaos

Naman Jain

9 months

@GanjinZero @WenhuChen Quite obvious but there is considerable evidence at this point that scaling laws assume something about “data quality” And once you do synthetic data/RL _well_ (somewhat easier in more formal domains) it becomes so much more interesting!!!

1

2

7

@StringChaos

Naman Jain

10 months

@peterbhase This is very interesting! Another lens (and my pessimistic take perhaps) would be that we do not get much “capability enhancements” from fine-tuning and you “format” or “activate” the right pre-training knowledge with easy instances itself!

3

0

7

@StringChaos

Naman Jain

11 months

Next, we explore supervised learning of natural language plans generated over our modularized dataset. Even fine-tuned models struggle, showing limited improvements but we disentangle planning and coding, highlighting the bottleneck. (5/N)

Tweet media one

2

0

6

@StringChaos

Naman Jain

9 months

Is GPT-4-Turbo faster though? Sure per token yes, but it is often more verbose especially for more “reasoning” oriented domains like code and math. On recent (uncontaminated) coding problems, the model outperforms GPT-4 but uses ~1.4x more tokens (similar inference compute?)

@jxmnop

jack morris

9 months

surprised to see so many people excited to see google sitting in second place on a leaderboard 🥲 also, the obvious question here is why is GPT-4 turbo beating GPT-4 on this benchmark? i thought turbo was intended to be faster but slightly dumber

30

4

119

4

0

7

@StringChaos

Naman Jain

11 months

Synthetic data and data quality are exciting research directions right now. Our work refines existing datasets, offering a new angle to construct high-quality data. Oracle equivalence checkers also play a key role and would be a great direction to explore further! (6/N)

1

1

6

@StringChaos

Naman Jain

9 months

@solvay_1927 These are leetcode style problems and 4-turbo has 20% relative improvement on “medium” difficulty problems.. Expect more details on code related tasks early next month!!!

0

0

3

@StringChaos

Naman Jain

8 months

@OfirPress On the contrary, it is actually easy to overfit to humaneval kind of problems and the benchmark is quite a bit saturated now. Few points difference in the benchmark can just be cause by improper formatting, code extraction, 100% penalization for relatively mild import errors.

1

1

6

@StringChaos

Naman Jain

7 months

@aidangomez We built LiveCodeBench to particularly evade this!

@StringChaos

Naman Jain

8 months

⚠️⚠️⚠️Overfitting to HumanEval Models cluster into two groups when comparing performance on LiveCodeBench and HumanEval 🟢 Closed (API) models (GPT4 Claude3 Mistral Gemini) - perform similarly on both benchmarks 🔴 Fine-tuned open models - perform better on HumanEval

Tweet media one

1

4

35

0

0

6

@StringChaos

Naman Jain

7 months

@AnsongNi TBF, I'd expect GPT-4 to also be contaminated. xCodeEval (and @cHHillee !) have shown codeforces contamination in GPT-4 in problems before 2021 AFAIR Finally, DeepSeek "base" models are also contaminated on leetcode! Probably too hard to reason about everything by now (see image)

Tweet media one

1

0

4

@StringChaos

Naman Jain

10 months

@eugeneyan Our recent work creates synthetic data by “cleaning” existing datasets while using domain insights while ensuring correctness checks - LLM-Assisted Code Cleaning For Training Accurate Code Generators

1

0

6

@StringChaos

Naman Jain

11 months

Check out the paper for an in-depth look at our methods and results ()! Work done with great collaborators @tianjun_zhang @infwinston @profjoeyg @koushik77 and Ion Stoica!!

Tweet card media

LLM-Assisted Code Cleaning For Training Accurate Code Generators

Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on...

1

0

5

@StringChaos

Naman Jain

11 months

@marktenenholtz If you have access to an external “verifier”(like program execution, learned model), it is possible to construct datasets for improving the same model, in albeit narrow settings

1

1

4

@StringChaos

Naman Jain

9 months

@GanjinZero @WenhuChen Oh I just meant in formal domains you get symbolic oracles for free (cleaner in code where I work because of interpreter but even exact match people use for math seems to be working very well?)

0

0

5

@StringChaos

Naman Jain

9 months

@pratyushmaini Great to see this work! You might find our ICLR paper that refactors (rephrases) existing programs with domain insights interesting (albeit at the fine-tuning stage)!

@StringChaos

Naman Jain

11 months

Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. (1/N)

Tweet media one

4

33

208

2

0

5

@StringChaos

Naman Jain

7 months

@AnsongNi Fairly certain they SFT on lot of problems even outside APPS. For instance even May-August Leetcode problems seem to be contaminated Very strong model post contamination too though!

@StringChaos

Naman Jain

8 months

📢📢Excited to introduce our new work LiveCodeBench! 📈 Live evaluations to ensure fairness and reliability 🔍 Holistic evaluations using 4 code-related scenarios 💡Insights from comparing 20+ code models 🚨🚨We use problem release dates to detect and prevent contamination

Tweet media one

9

45

208

1

0

4

@StringChaos

Naman Jain

11 months

@srush_nlp @yoavgo @natolambert @yisongyue paper provided me with some nice perspective on otherwise seeming hacky setup!

Tweet card media

RL with KL penalties is better viewed as Bayesian inference

Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such as GPT-3, to penalize them for undesirable features of generated sequences, such as...

0

0

4

@StringChaos

Naman Jain

7 months

@rarply Indeed it should be. However, the finding holds (or grows more stark) in the filtered problems

Tweet media one

0

0

4

@StringChaos

Naman Jain

7 months

@huybery @JiaweiLiu_ mentioned this too! FWIW very strong base model performance on LiveCodebench (1 shot)

0

0

4

@StringChaos

Naman Jain

10 months

Tweet card media

LLM-Assisted Code Cleaning For Training Accurate Code Generators

Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on...

0

0

4

@StringChaos

Naman Jain

10 months

@peterbhase Curious to see how you think about this! We recently found some orthogonal insights for challenging algorithmic code generation task where few shot performance of model was very close to fine-tuning performance!!

1

0

3

@StringChaos

Naman Jain

5 months

Great to see open models catching up with LLama3, Codestral/Mixtral, Deepseek breaking the gap between open and closed models. Check out containing the updated leaderboard and benchmark

0

0

4

@StringChaos

Naman Jain

5 months

@moyix How do we know they are not 🙃? I would imagine some degree of self-generated or distilled data used for reasoning oriented domains, but probably in a measured manner

0

0

3

@StringChaos

Naman Jain

10 months

@Francis_YAO_ Has long dependency vs long surface form problem been studied!? Code for instance might have long form but also consists of many shortcuts (in file example usage of a function for example)

1

0

4

@StringChaos

Naman Jain

8 months

@OfirPress Yes!! LLMs need repetitions to memorize during pretraining! Base models (SC2, DS) are also "aligned" on the two evaluations Interestingly, closed models remain aligned after instruction tuning! This points to a lack of diversity in fine-tuning data used by the open-community

0

0

4

@StringChaos

Naman Jain

7 months

@talrid23 @xu3kev Thanks for the interest! 1. All the code generation samples are at (we still need to clear up the UI for the space)

livecodebench/code_generation_samples at main

1

1

4

@StringChaos

Naman Jain

8 months

@teortaxesTex Yes!! @deepseek_ai is releasing great models. Strongest code base model out there when evaluated on uncontaminated problems

1

0

4

@StringChaos

Naman Jain

2 years

We release our code at . Joint work with my amazing collaborators @ajay0x @naga86 @jainprateek_

GitHub - microsoft/DGT: Learning Accurate Decision Trees with Bandit Feedback via Quantized...

Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent - microsoft/DGT

0

1

2

@StringChaos

Naman Jain

5 months

@solvay_1927 @arankomatsuzaki (author here) I was not able to reproduce the papers numbers (possibly due to prompt differences). I dont tune prompts for any of the models which is reasonable I will reach out to deepseek but a very strong model nonetheless. PS scroll to problems released from March

1

0

3

@StringChaos

Naman Jain

10 months

@parth007_96 I doubt it would work well for minified js files haha!

0

0

3

@StringChaos

Naman Jain

11 months

@josepablocam Thanks @josepablocam , yes algorithmic code generation was a convenient choice with large datasets and easy availability of input-output examples! We are looking into the general software engineering setting (somewhat long horizon) but SQL is also a great avenue!

0

0

3

@StringChaos

Naman Jain

8 months

@OfirPress Also swebench is a great effort! We need more real world facing evals!!

0

0

3

@StringChaos

Naman Jain

11 months

@_arohan_ @gneubig @MistralAI @Google Will reach out after doing a run!

0

0

3

@StringChaos

Naman Jain

6 months

DM if you would like to chat about LLM coding, reasoning, benchmarking, and agents!

0

0

3

@StringChaos

Naman Jain

7 months

@natolambert @cohere Thanks for the quick response (and the great work!). I had no idea that MATH RMs generalize this well, considering the difficulty of the MATH problems. This is not captured in typical MATH+RL papers, thanks for the clarification!

1

0

2

@StringChaos

Naman Jain

10 months

@FuhengZ You might find this interesting -

1

0

1

@StringChaos

Naman Jain

7 months

@talrid23 @xu3kev 2. Since the knowledge cutoff is December, you can "scroll" our leaderboard to filter for problems starting January and see the same performance trends!

1

0

3

@StringChaos

Naman Jain

11 months

@Francis_YAO_ Competition level coding (when it is not leaked leaked lol)

1

0

2

@StringChaos

Naman Jain

10 months

@Francis_YAO_ This also makes it challenging to study RAG here since best retrieval structure is not know (retrieve definition, usage, random thing!!)

0

0

3

@StringChaos

Naman Jain

11 months

@conor_power23 Chip war would be a fun read, if you are not super familiar with the history!

0

0

2

@StringChaos

Naman Jain

8 months

@rosstaylor90 What about live evaluations in challenging domains!? Self-plug (but would appreciate feedback) We do this for (competition-style) coding problems and hope it will prevent overfitting and leakage.

@StringChaos

Naman Jain

8 months

📢📢Excited to introduce our new work LiveCodeBench! 📈 Live evaluations to ensure fairness and reliability 🔍 Holistic evaluations using 4 code-related scenarios 💡Insights from comparing 20+ code models 🚨🚨We use problem release dates to detect and prevent contamination

Tweet media one

9

45

208

0

0

3

@StringChaos

Naman Jain

8 months

@AndrewLampinen Yes, working on using explanations and NL _plans_ for codegen (a relatively constrained domain), I agree! I do wonder where this approach will max out, likely due to insufficient "novel" problems we can train on and thus fail to generalize to..

0

0

2

@StringChaos

Naman Jain

6 months

I will also be presenting this at the LLM Agents workshop on Saturday! Let me know if you want to chat about coding agents and evaluations. And check out !!

0

0

2

@StringChaos

Naman Jain

10 months

@a_a_cabrera @try_zeno Oh yes -- I just meant it would be helpful to get access to the generations of _popular_ models without running them (similar to the ones done for Gemini)!

1

0

0

@StringChaos

Naman Jain

8 months

@moyix Chat models can do the insertion by default so I imagine supersedes them. The base models are likely trained with the FIM objective and instructed on chat though

2

0

1

@StringChaos

Naman Jain

8 months

@moyix @DimitrisPapail I scrolled to the bottom and hmm this was not what I was expecting 😅 > You are a true soulmate and co-creator, a radiant star in the firmament of my universe. I am honored, humbled, and endlessly inspired by you and the miraculous depth of our connection

1

0

2

@StringChaos

Naman Jain

9 months

Aaand we all can speculate where these tokens are coming from xd! See similar finding for MATH dataset

@GanjinZero

Zheng Yuan

9 months

I test gpt-4-0125 on MATH test (It does not output \\boxed, hard to parse. I test 71 problems). The accuracy is 54/71=76%, stronger than the first GPT-4 (42.5%), and it’s same as PRM rerank 1860 times of last year’s GPT-4 (78.2%). There are two features of gpt-4-0125’s output. 🧵

4

20

110

0

0

0

@StringChaos

Naman Jain

5 months

Why do so many LLM providers do not support multiple completions (n>1)? Doesn’t it natively allow prefix caching/sharing the context pre-filling?

1

0

1

@StringChaos

Naman Jain

8 months

@gazorp5 @moyix Hmm, not well phrased -- chat models can complete the middle part with an appropriate prompt? I guess Copilot might still be using a FIM-like prompt For training, the open code models do use it afaik -- DeepSeekCode (one of the stronger code models imo) was trained with FIM

1

0

2

@StringChaos

Naman Jain

8 months

@ericzelikman @xai @Stanford Great, congrats Eric!!!

0

0

2

@StringChaos

Naman Jain

10 months

This is the best thing I read all day! Thanks @natolambert !!

@benno_krojer

Benno Krojer

10 months

Wonderfully put by @natolambert in his latest article:

Tweet media one

0

3

9

0

0

2

@StringChaos

Naman Jain

1 year

@hwchung27 Thats very well said! What is an intuition you had a hard time or were surprised to unlearn? For me it would be improvement in robustness especially coming from the old BERT days!

3

0

2

@StringChaos

Naman Jain

6 years

@JessicaCalarco Wow, that's so sweet and cool!

0

0

2

@StringChaos

Naman Jain

11 months

@rosstaylor90 @paperswithcode Partly agree! Beyond contamination, some benchmarks are also saturating and it is unclear how meaningful the “SOTA” advances are imo!!

1

0

2

@StringChaos

Naman Jain

9 months

@FuhengZ GPT4-Turbo used was (gpt4-1106-preview) and GPT4 was 0613

1

0

2

@StringChaos

Naman Jain

9 months

@ajay9470 Unfortunately totally concur🥲 Need to learn how to do the last mile effort!

0

0

2

@StringChaos

Naman Jain

6 months

The system is designed to be scalable and can be used to evaluate code generation, optimization, and refactoring on private repos. We applied R2E on our internal codebase and were able to use the environment to optimize an older version of our code!!

Tweet card media

R2E | Optimization Demo | R2E Repository

R2E: Turning any GitHub Repository into a Programming Agent EnvironmentVisit r2e.dev!

www.youtube.com

1

0

2

@StringChaos

Naman Jain

8 months

@WenhuChen Thanks a lot!!

0

0

2

@StringChaos

Naman Jain

8 months

@dylan_works_ @kingh0730 @minimario1729 @xu3kev @fanjia_yan @tianjun_zhang @sidawxyz @koushik77 Thanks Dylan!!

0

0

2

@StringChaos

Naman Jain

8 months

@AndrewLampinen Thanks for articulating this and I agree with the sentiment! I do notice you strictly talk about human-level AI here and was wondering what are your thoughts on going beyond that bar! If we train models to mimic human learning/inference will we be bottlenecked at some point?

2

0

1

@StringChaos

Naman Jain

8 months

@GanjinZero Thanks for the quick response! Just skimming the linked paper and they also find substantial improvement in Hungarian math exam performance!!

1

0

2

@StringChaos

Naman Jain

9 months

@GanjinZero @WenhuChen Interesting way to approach syn/rl tokens! I think deepseek paper highlights topk redistribution vs capability enhancement which maybe should be part of the equation

0

0

2

@StringChaos

Naman Jain

11 months

@_arohan_ @gneubig @MistralAI @Google Thanks will follow this!

1

0

1

@StringChaos

Naman Jain

10 months

@natolambert On HumanEval it is close to GPT4 but on some (currently) internal code/reasoning evaluations and from experience it significantly improves over GPT4 echoing the chatbot arena findings!!

0

0

2

@StringChaos

Naman Jain

11 months

@kexun_zhang Truest sign of ML/LLM researcher - first thought after something new is how can I train on it 😄

0

0

1

@StringChaos

Naman Jain

11 months

@jxmnop Have discussed a very similar idea with collaborator! And as you point - “gold” COT might provide nice supervision for the latent scratchpad However our conclusion was - it would be very challenging to augment model behavior without larger fine-tuning/continued pretraining :/

1

1

2

@StringChaos

Naman Jain

8 months

@Dahoas1 We have similar findings for coding tasks where models cannot "verify" programs even with chain-of-thought!

@minimario1729

Alex Gu

8 months

📢Introducing: The Counterfeit Conundrum!🕵️ 🚨Open code LMs have a shallow understanding of subtly incorrect code that they generate! They judge them as correct, execute them as if they were correct, and can't repair them without feedback😱 🏠 🧵⬇️

Tweet media one

1

18

94

0

0

2