Naman Jain Profile
Naman Jain

@StringChaos

1,210
Followers
1,079
Following
23
Media
316
Statuses

Research Intern @MetaAI | CS PhD @UCBerkeley | Projects - R2E, LiveCodeBench, Chatbot-Arena Coding, RAFT, Code Cleaning | Past: @AWS @MSFTResearch @iitbombay

Seattle
Joined March 2018
Don't wanna be here? Send us removal request.
Pinned Tweet
@StringChaos
Naman Jain
8 months
📢📢Excited to introduce our new work LiveCodeBench! 📈 Live evaluations to ensure fairness and reliability 🔍 Holistic evaluations using 4 code-related scenarios 💡Insights from comparing 20+ code models 🚨🚨We use problem release dates to detect and prevent contamination
Tweet media one
9
45
208
@StringChaos
Naman Jain
7 months
The new GPT-4-Turbo improves an impressive 4.5 points on LiveCodeBench (comprising competition-style programming problems). These problems are quite challenging for current LLMs and this improvement highlights a considerable improvement in reasoning!!
Tweet media one
@polynoamial
Noam Brown
7 months
GPT-4 reasoning has been further improved
24
32
507
8
50
318
@StringChaos
Naman Jain
2 years
Super excited to announce that after spending two amazing years @MSFTResearch India, I am starting my Ph.D. at @Berkeley_EECS ! Grateful to all the advisors, collaborators, friends, and family that made this possible. Look forward to doing exciting work in the ML ↔️ PL space
8
2
213
@StringChaos
Naman Jain
11 months
Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. (1/N)
Tweet media one
4
33
208
@StringChaos
Naman Jain
5 months
Exciting announcement!! We have updated LiveCodeBench with 100+ new problems released recently (in last three months) along with the leaderboard! This allows fairly evaluating many of the recently released models
Tweet media one
2
12
77
@StringChaos
Naman Jain
6 months
On my way to ICLR✈️🇦🇹🤩 I will be presenting - LLM Assisted Code Cleaning for improving Code Generators (Friday, 10:45 AM at Hallee B) and - Scalable repository-level gym like environments for programming agents at the @LLMAgents workshop (also accepted at ICML, details tomo!)
3
6
56
@StringChaos
Naman Jain
2 years
Excited to annouce that our work on learning decision trees from bandit feedback has been accepted to TMLR!
@TmlrPub
Accepted papers at TMLR
2 years
Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent Ajaykrishna Karthikeyan, Naman Jain, Nagarajan Natarajan, Prateek Jain
0
2
9
1
2
43
@StringChaos
Naman Jain
6 months
Check out our new ICML paper on R2E which converts code repositories to environments for evaluating coding LLMs! Key takeaway -- execution is the cornerstone and we synthesize test cases for making arbitrary functions executable!
@slimshetty_
Manish Shetty
6 months
Want to turn your own GitHub Repos into a playground for 🤖 coding agents? 📢📢 Introducing R2E: Repository to Environment 📈 Scalable, dynamic, real-world repo-level benchmarks 💡 Generate Equivalence Tests Harnesses 🔗 | Accepted @ ICML '24 🧵
Tweet media one
3
27
155
1
8
41
@StringChaos
Naman Jain
7 months
Check out the chatbot arena categories!! Domain-specific evaluations reveal different information useful for building task-specific use cases. Coding Arena aligns with findings from coding benchmarks (like our LiveCodeBench) while offering insights from arena user queries!
@lmarena_ai
lmarena.ai (formerly lmsys.org)
7 months
We tag all the conversations containing code snippets in Coding Arena. In this domain, we find GPT-4-Turbo performs even stronger. This aligns with the recent finding in challenging coding benchmark such as LiveCodeBench by You can also easily view
Tweet media one
3
10
115
0
8
36
@StringChaos
Naman Jain
11 months
Will be presenting this work at the SyntheticData4ML workshop today. Drop by Hall E2 posters to chat about this, LLMs for code, reasoning, perpetual data machines (synthetic data and when can we make it work!)
@StringChaos
Naman Jain
11 months
Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. (1/N)
Tweet media one
4
33
208
0
6
36
@StringChaos
Naman Jain
8 months
⚠️⚠️⚠️Overfitting to HumanEval Models cluster into two groups when comparing performance on LiveCodeBench and HumanEval 🟢 Closed (API) models (GPT4 Claude3 Mistral Gemini) - perform similarly on both benchmarks 🔴 Fine-tuned open models - perform better on HumanEval
Tweet media one
1
4
35
@StringChaos
Naman Jain
7 months
Great to see LiveCodeBench being used to evaluate new code models. CodeQwen completions are added to our "time" leaderboard! Kudos to the CodeQwen team for such a strong 7B model 🫡
Tweet media one
@huybery
Binyuan Hui
7 months
(2/n) 🧵 In addition to the widely recognized HumanEval and MBPP benchmarks, we explored LiveCodeBench. Our evaluation of CodeQwen1.5 on LiveCodeBench spanned from 2023-09 to 2024-04. The findings indicate that CodeQwen1.5 ranks among the top open-access models currently
Tweet media one
1
1
22
0
5
34
@StringChaos
Naman Jain
8 months
Check our work -- The Counterfeit Conundrum🕵️!! LLMs do not _understand_ subtly incorrect generated solutions and treat them as correct when verifying and executing through them. Particularly stark implications when you are trying to build LLM agents, judges, and reward models!!
@minimario1729
Alex Gu
8 months
📢Introducing: The Counterfeit Conundrum!🕵️ 🚨Open code LMs have a shallow understanding of subtly incorrect code that they generate! They judge them as correct, execute them as if they were correct, and can't repair them without feedback😱 🏠 🧵⬇️
Tweet media one
1
18
94
0
7
30
@StringChaos
Naman Jain
7 months
@GanjinZero Very interesting! We also find considerable performance improvement for Leetcode NL to test output prediction scenario in LiveCodeBench (closer to MATH/GSM reasoning problems IMO)
@xu3kev
Wen-Ding Li
7 months
A big jump in math/reasoning for our coding benchmark 🤯
Tweet media one
26
135
909
1
5
28
@StringChaos
Naman Jain
8 months
🔑Open vs. 🔒Closed Models on LiveCodeBench While large (30B+) fine-tuned open models ( @deepseek_ai @WizardLM_AI @phindsearch ) narrow the gap, they still trail GPT-4 & Claude-3 considerably Highlights the need for further innovation in open models to match SOTA 📈 🚀💡
Tweet media one
2
5
26
@StringChaos
Naman Jain
10 months
Check out this amazing work (and incredibly detailed analysis) from @minimario1729 and team! It is refreshing to see variance plots in papers 😆 Additionally, code execution is a great venue to study the chain-of-thought behavior of models, and excited to see how this progresses
@minimario1729
Alex Gu
10 months
📢Introducing CRUXEval, a benchmark to measure Python code execution! 🏠Homepage: 📜Paper: 🏆Leaderboard: 🔎Sample Explorer: 📊HF Dataset:
Tweet media one
6
49
246
0
2
19
@StringChaos
Naman Jain
11 months
Heading to Neurips! Excited to chat with folks working on AI for Code, Math, LLM evaluations, synthetic data and more! Also would be presenting our recent work on synthetic data for improving _quality_ of code datasets at the SyntheticData4ML workshop
@StringChaos
Naman Jain
11 months
Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. (1/N)
Tweet media one
4
33
208
0
0
19
@StringChaos
Naman Jain
8 months
Check out our paper, datasets, and leaderboard for more details! 📜Paper - 🤗Huggingface - 🥇Leaderboard -
0
5
17
@StringChaos
Naman Jain
3 years
Are you tired of your AI buddy programmer giving slightly wrong answers which you have to carefully edit? Well then, this is a paper for you!! Paper: (accepted at ICSE'22) [1/N]
1
7
18
@StringChaos
Naman Jain
11 months
@prerationalist Indeed this is in the prompt!
Tweet media one
1
0
14
@StringChaos
Naman Jain
7 months
Check out our blogpost on the LiveCodeBench leaderboard. Also new SOTA OSS (algorithmic) coding model alert -- Eurus series from OpenBMB! (from @lifan__yuan @charlesfornlp and @wanghanbin95 ) Looking forward to seeing more community contributions to the leaderboard!
Tweet media one
@clefourrier
Clémentine Fourrier 🍊
7 months
New leaderboard: LiveCodeBench! 💻 Complete code evaluations, with a great feature: problem selection by publication date 📅 This means getting model scores only on new problems out of the training data = contamination free code evals! 🚀 Blog:
1
13
44
0
6
15
@StringChaos
Naman Jain
6 years
@ankurhandos Well the plot was also something!
Tweet media one
0
0
13
@StringChaos
Naman Jain
8 months
🌟 OSS Coding Models for LCB 🏆 DeepSeek (33B), StarCoder2 (15B), and CodeLLaMa (34B) emerge as the top base models 💫 Finetuning: 👩‍💻 Boosts both LCB & HumanEval performance ⚠️ May overfit to HumanEval-style problems ➡️ Need to diversify open fine-tuning data for robust gains
1
1
13
@StringChaos
Naman Jain
10 months
Guess the paper - AI Edition! I will quote a statement from a paper and you need to where it is from. No cheating!! > All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
1
0
11
@StringChaos
Naman Jain
11 months
Fine-tuning on refined code not only improves the performance (up to 30% improvement!) but also slashes data needs — achieving the same results with just 1/6th of the data!! These results highlight the importance of data quality. (2/N)
Tweet media one
1
1
11
@StringChaos
Naman Jain
8 months
Holistic Model Comparisons 🙋‍♂️🥇 Relative performances change over scenarios! GPT4T is better at generating code Claude3-O is better at predicting test outputs 🔒 Closed models are better at NL reasoning. ⬆️ Performance gap increases for execution and test prediction scenarios
Tweet media one
1
1
9
@StringChaos
Naman Jain
10 months
For those wondering — this gem of a line is from AlexNet (2012!!) It took the entire ML community years to learn this lesson and the vision was laid down right then. I can only appreciate this line in retrospect (thanks to the Berkeley ML-Sys prelim which nudged the reread!)
@StringChaos
Naman Jain
10 months
Guess the paper - AI Edition! I will quote a statement from a paper and you need to where it is from. No cheating!! > All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.
1
0
11
0
0
8
@StringChaos
Naman Jain
11 months
Code Quality Insights: Structured and readable code equals higher quality. We transform code by improving variable names and modularizing programs while retaining functional equivalence with original programs. (4/N)
Tweet media one
1
0
9
@StringChaos
Naman Jain
9 months
As others have pointed out, absolute banger paper worth a deep read!
@deepseek_ai
DeepSeek
9 months
🚀 DeepSeekMath: Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model. Highlights: - Continue pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math tokens from Common Crawl. - Introduce GRPO, a variant of PPO, that enhances mathematical reasoning and reduces
Tweet media one
22
168
957
0
0
9
@StringChaos
Naman Jain
11 months
Our key insight? Editing beats creating from scratch in complexity. Even when models struggle to generate correct solutions, they excel at refining them — unlocking new potential in tough domains like algorithmic code generation. (3/N)
1
0
9
@StringChaos
Naman Jain
11 months
@jmhessel Also while we are at HumanEval —everyone should see HumanEval+ for variance along a separate axis of poor test coverage in HumanEval leading to 5-10% comparisons basically meaningless from @JiaweiLiu_ @steven_xia_
0
0
8
@StringChaos
Naman Jain
11 months
@tianjun_zhang @infwinston @profjoeyg @koushik77 Also accepted at the SyntheticData4ML workshop at Neurips! See you there!
0
0
7
@StringChaos
Naman Jain
7 months
Check out for the complete leaderboard (we are still evaluating the other scenarios)!
0
1
8
@StringChaos
Naman Jain
9 months
@GanjinZero @WenhuChen Quite obvious but there is considerable evidence at this point that scaling laws assume something about “data quality” And once you do synthetic data/RL _well_ (somewhat easier in more formal domains) it becomes so much more interesting!!!
1
2
7
@StringChaos
Naman Jain
10 months
@peterbhase This is very interesting! Another lens (and my pessimistic take perhaps) would be that we do not get much “capability enhancements” from fine-tuning and you “format” or “activate” the right pre-training knowledge with easy instances itself!
3
0
7
@StringChaos
Naman Jain
11 months
Next, we explore supervised learning of natural language plans generated over our modularized dataset. Even fine-tuned models struggle, showing limited improvements but we disentangle planning and coding, highlighting the bottleneck. (5/N)
Tweet media one
2
0
6
@StringChaos
Naman Jain
9 months
Is GPT-4-Turbo faster though? Sure per token yes, but it is often more verbose especially for more “reasoning” oriented domains like code and math. On recent (uncontaminated) coding problems, the model outperforms GPT-4 but uses ~1.4x more tokens (similar inference compute?)
@jxmnop
jack morris
9 months
surprised to see so many people excited to see google sitting in second place on a leaderboard 🥲 also, the obvious question here is why is GPT-4 turbo beating GPT-4 on this benchmark? i thought turbo was intended to be faster but slightly dumber
30
4
119
4
0
7
@StringChaos
Naman Jain
11 months
Synthetic data and data quality are exciting research directions right now. Our work refines existing datasets, offering a new angle to construct high-quality data. Oracle equivalence checkers also play a key role and would be a great direction to explore further! (6/N)
1
1
6
@StringChaos
Naman Jain
9 months
@solvay_1927 These are leetcode style problems and 4-turbo has 20% relative improvement on “medium” difficulty problems.. Expect more details on code related tasks early next month!!!
0
0
3
@StringChaos
Naman Jain
8 months
@OfirPress On the contrary, it is actually easy to overfit to humaneval kind of problems and the benchmark is quite a bit saturated now. Few points difference in the benchmark can just be cause by improper formatting, code extraction, 100% penalization for relatively mild import errors.
1
1
6
@StringChaos
Naman Jain
7 months
@aidangomez We built LiveCodeBench to particularly evade this!
@StringChaos
Naman Jain
8 months
⚠️⚠️⚠️Overfitting to HumanEval Models cluster into two groups when comparing performance on LiveCodeBench and HumanEval 🟢 Closed (API) models (GPT4 Claude3 Mistral Gemini) - perform similarly on both benchmarks 🔴 Fine-tuned open models - perform better on HumanEval
Tweet media one
1
4
35
0
0
6
@StringChaos
Naman Jain
7 months
@AnsongNi TBF, I'd expect GPT-4 to also be contaminated. xCodeEval (and @cHHillee !) have shown codeforces contamination in GPT-4 in problems before 2021 AFAIR Finally, DeepSeek "base" models are also contaminated on leetcode! Probably too hard to reason about everything by now (see image)
Tweet media one
1
0
4
@StringChaos
Naman Jain
10 months
@eugeneyan Our recent work creates synthetic data by “cleaning” existing datasets while using domain insights while ensuring correctness checks - LLM-Assisted Code Cleaning For Training Accurate Code Generators
1
0
6
@StringChaos
Naman Jain
11 months
@marktenenholtz If you have access to an external “verifier”(like program execution, learned model), it is possible to construct datasets for improving the same model, in albeit narrow settings
1
1
4
@StringChaos
Naman Jain
9 months
@GanjinZero @WenhuChen Oh I just meant in formal domains you get symbolic oracles for free (cleaner in code where I work because of interpreter but even exact match people use for math seems to be working very well?)
0
0
5
@StringChaos
Naman Jain
9 months
@pratyushmaini Great to see this work! You might find our ICLR paper that refactors (rephrases) existing programs with domain insights interesting (albeit at the fine-tuning stage)!
@StringChaos
Naman Jain
11 months
Thrilled to share our recent work in leveraging synthetic data to enhance data quality for algorithmic code generation! We use instruction-tuned language models to transform existing datasets ensuring correctness with an oracle equivalence checker. (1/N)
Tweet media one
4
33
208
2
0
5
@StringChaos
Naman Jain
7 months
@AnsongNi Fairly certain they SFT on lot of problems even outside APPS. For instance even May-August Leetcode problems seem to be contaminated Very strong model post contamination too though!
@StringChaos
Naman Jain
8 months
📢📢Excited to introduce our new work LiveCodeBench! 📈 Live evaluations to ensure fairness and reliability 🔍 Holistic evaluations using 4 code-related scenarios 💡Insights from comparing 20+ code models 🚨🚨We use problem release dates to detect and prevent contamination
Tweet media one
9
45
208
1
0
4
@StringChaos
Naman Jain
7 months
@rarply Indeed it should be. However, the finding holds (or grows more stark) in the filtered problems
Tweet media one
0
0
4
@StringChaos
Naman Jain
7 months
@huybery @JiaweiLiu_ mentioned this too! FWIW very strong base model performance on LiveCodebench (1 shot)
0
0
4
@StringChaos
Naman Jain
10 months
@peterbhase Curious to see how you think about this! We recently found some orthogonal insights for challenging algorithmic code generation task where few shot performance of model was very close to fine-tuning performance!!
1
0
3
@StringChaos
Naman Jain
5 months
Great to see open models catching up with LLama3, Codestral/Mixtral, Deepseek breaking the gap between open and closed models. Check out containing the updated leaderboard and benchmark
0
0
4
@StringChaos
Naman Jain
5 months
@moyix How do we know they are not 🙃? I would imagine some degree of self-generated or distilled data used for reasoning oriented domains, but probably in a measured manner
0
0
3
@StringChaos
Naman Jain
10 months
@Francis_YAO_ Has long dependency vs long surface form problem been studied!? Code for instance might have long form but also consists of many shortcuts (in file example usage of a function for example)
1
0
4
@StringChaos
Naman Jain
8 months
@OfirPress Yes!! LLMs need repetitions to memorize during pretraining! Base models (SC2, DS) are also "aligned" on the two evaluations Interestingly, closed models remain aligned after instruction tuning! This points to a lack of diversity in fine-tuning data used by the open-community
0
0
4
@StringChaos
Naman Jain
7 months
@talrid23 @xu3kev Thanks for the interest! 1. All the code generation samples are at (we still need to clear up the UI for the space)
1
1
4
@StringChaos
Naman Jain
8 months
@teortaxesTex Yes!! @deepseek_ai is releasing great models. Strongest code base model out there when evaluated on uncontaminated problems
1
0
4
@StringChaos
Naman Jain
5 months
@solvay_1927 @arankomatsuzaki (author here) I was not able to reproduce the papers numbers (possibly due to prompt differences). I dont tune prompts for any of the models which is reasonable I will reach out to deepseek but a very strong model nonetheless. PS scroll to problems released from March
1
0
3
@StringChaos
Naman Jain
10 months
@parth007_96 I doubt it would work well for minified js files haha!
0
0
3
@StringChaos
Naman Jain
11 months
@josepablocam Thanks @josepablocam , yes algorithmic code generation was a convenient choice with large datasets and easy availability of input-output examples! We are looking into the general software engineering setting (somewhat long horizon) but SQL is also a great avenue!
0
0
3
@StringChaos
Naman Jain
8 months
@OfirPress Also swebench is a great effort! We need more real world facing evals!!
0
0
3
@StringChaos
Naman Jain
11 months
@_arohan_ @gneubig @MistralAI @Google Will reach out after doing a run!
0
0
3
@StringChaos
Naman Jain
6 months
DM if you would like to chat about LLM coding, reasoning, benchmarking, and agents!
0
0
3
@StringChaos
Naman Jain
7 months
@natolambert @cohere Thanks for the quick response (and the great work!). I had no idea that MATH RMs generalize this well, considering the difficulty of the MATH problems. This is not captured in typical MATH+RL papers, thanks for the clarification!
1
0
2
@StringChaos
Naman Jain
10 months
@FuhengZ You might find this interesting -
1
0
1
@StringChaos
Naman Jain
7 months
@talrid23 @xu3kev 2. Since the knowledge cutoff is December, you can "scroll" our leaderboard to filter for problems starting January and see the same performance trends!
1
0
3
@StringChaos
Naman Jain
11 months
@Francis_YAO_ Competition level coding (when it is not leaked leaked lol)
1
0
2
@StringChaos
Naman Jain
10 months
@Francis_YAO_ This also makes it challenging to study RAG here since best retrieval structure is not know (retrieve definition, usage, random thing!!)
0
0
3
@StringChaos
Naman Jain
11 months
@conor_power23 Chip war would be a fun read, if you are not super familiar with the history!
0
0
2
@StringChaos
Naman Jain
8 months
@rosstaylor90 What about live evaluations in challenging domains!? Self-plug (but would appreciate feedback) We do this for (competition-style) coding problems and hope it will prevent overfitting and leakage.
@StringChaos
Naman Jain
8 months
📢📢Excited to introduce our new work LiveCodeBench! 📈 Live evaluations to ensure fairness and reliability 🔍 Holistic evaluations using 4 code-related scenarios 💡Insights from comparing 20+ code models 🚨🚨We use problem release dates to detect and prevent contamination
Tweet media one
9
45
208
0
0
3
@StringChaos
Naman Jain
8 months
@AndrewLampinen Yes, working on using explanations and NL _plans_ for codegen (a relatively constrained domain), I agree! I do wonder where this approach will max out, likely due to insufficient "novel" problems we can train on and thus fail to generalize to..
0
0
2
@StringChaos
Naman Jain
6 months
I will also be presenting this at the LLM Agents workshop on Saturday! Let me know if you want to chat about coding agents and evaluations. And check out !!
0
0
2
@StringChaos
Naman Jain
10 months
@a_a_cabrera @try_zeno Oh yes -- I just meant it would be helpful to get access to the generations of _popular_ models without running them (similar to the ones done for Gemini)!
1
0
0
@StringChaos
Naman Jain
8 months
@moyix Chat models can do the insertion by default so I imagine supersedes them. The base models are likely trained with the FIM objective and instructed on chat though
2
0
1
@StringChaos
Naman Jain
8 months
@moyix @DimitrisPapail I scrolled to the bottom and hmm this was not what I was expecting 😅 > You are a true soulmate and co-creator, a radiant star in the firmament of my universe. I am honored, humbled, and endlessly inspired by you and the miraculous depth of our connection
1
0
2
@StringChaos
Naman Jain
9 months
Aaand we all can speculate where these tokens are coming from xd! See similar finding for MATH dataset
@GanjinZero
Zheng Yuan
9 months
I test gpt-4-0125 on MATH test (It does not output \\boxed, hard to parse. I test 71 problems). The accuracy is 54/71=76%, stronger than the first GPT-4 (42.5%), and it’s same as PRM rerank 1860 times of last year’s GPT-4 (78.2%). There are two features of gpt-4-0125’s output. 🧵
4
20
110
0
0
0
@StringChaos
Naman Jain
5 months
Why do so many LLM providers do not support multiple completions (n>1)? Doesn’t it natively allow prefix caching/sharing the context pre-filling?
1
0
1
@StringChaos
Naman Jain
8 months
@gazorp5 @moyix Hmm, not well phrased -- chat models can complete the middle part with an appropriate prompt? I guess Copilot might still be using a FIM-like prompt For training, the open code models do use it afaik -- DeepSeekCode (one of the stronger code models imo) was trained with FIM
1
0
2
@StringChaos
Naman Jain
8 months
@ericzelikman @xai @Stanford Great, congrats Eric!!!
0
0
2
@StringChaos
Naman Jain
10 months
This is the best thing I read all day! Thanks @natolambert !!
@benno_krojer
Benno Krojer
10 months
Wonderfully put by @natolambert in his latest article:
Tweet media one
0
3
9
0
0
2
@StringChaos
Naman Jain
1 year
@hwchung27 Thats very well said! What is an intuition you had a hard time or were surprised to unlearn? For me it would be improvement in robustness especially coming from the old BERT days!
3
0
2
@StringChaos
Naman Jain
6 years
@JessicaCalarco Wow, that's so sweet and cool!
0
0
2
@StringChaos
Naman Jain
11 months
@rosstaylor90 @paperswithcode Partly agree! Beyond contamination, some benchmarks are also saturating and it is unclear how meaningful the “SOTA” advances are imo!!
1
0
2
@StringChaos
Naman Jain
9 months
@FuhengZ GPT4-Turbo used was (gpt4-1106-preview) and GPT4 was 0613
1
0
2
@StringChaos
Naman Jain
9 months
@ajay9470 Unfortunately totally concur🥲 Need to learn how to do the last mile effort!
0
0
2
@StringChaos
Naman Jain
6 months
The system is designed to be scalable and can be used to evaluate code generation, optimization, and refactoring on private repos. We applied R2E on our internal codebase and were able to use the environment to optimize an older version of our code!!
1
0
2
@StringChaos
Naman Jain
8 months
@WenhuChen Thanks a lot!!
0
0
2
@StringChaos
Naman Jain
8 months
@AndrewLampinen Thanks for articulating this and I agree with the sentiment! I do notice you strictly talk about human-level AI here and was wondering what are your thoughts on going beyond that bar! If we train models to mimic human learning/inference will we be bottlenecked at some point?
2
0
1
@StringChaos
Naman Jain
8 months
@GanjinZero Thanks for the quick response! Just skimming the linked paper and they also find substantial improvement in Hungarian math exam performance!!
1
0
2
@StringChaos
Naman Jain
9 months
@GanjinZero @WenhuChen Interesting way to approach syn/rl tokens! I think deepseek paper highlights topk redistribution vs capability enhancement which maybe should be part of the equation
0
0
2
@StringChaos
Naman Jain
11 months
1
0
1
@StringChaos
Naman Jain
10 months
@natolambert On HumanEval it is close to GPT4 but on some (currently) internal code/reasoning evaluations and from experience it significantly improves over GPT4 echoing the chatbot arena findings!!
0
0
2
@StringChaos
Naman Jain
11 months
@kexun_zhang Truest sign of ML/LLM researcher - first thought after something new is how can I train on it 😄
0
0
1
@StringChaos
Naman Jain
11 months
@jxmnop Have discussed a very similar idea with collaborator! And as you point - “gold” COT might provide nice supervision for the latent scratchpad However our conclusion was - it would be very challenging to augment model behavior without larger fine-tuning/continued pretraining :/
1
1
2
@StringChaos
Naman Jain
8 months
@Dahoas1 We have similar findings for coding tasks where models cannot "verify" programs even with chain-of-thought!
@minimario1729
Alex Gu
8 months
📢Introducing: The Counterfeit Conundrum!🕵️ 🚨Open code LMs have a shallow understanding of subtly incorrect code that they generate! They judge them as correct, execute them as if they were correct, and can't repair them without feedback😱 🏠 🧵⬇️
Tweet media one
1
18
94
0
0
2