[1/n]
🚀 Excited to share our latest work on OpenCodeInterpreter! With a blend of execution results and human feedback, we've achieved significant advancements in code generation. Here are the key points:
✨ Introducing OpenCodeInterpreter - a leap in iterative code refinement.
I'm extremely excited to announce "the big bomb"!: Neo and Matrix, that we're working on with colleagues and friends from open-source community, , wuhan ai, and . Neo is the first fully-transparent bilingual large language model, with
[1/n]
🎉🎉🎉 Excited to share our latest work: "The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis"! We delve into the dynamics of LLMs across different scales and domains.
💡Highlights include:
🗺️ Comprehensive Model Evaluation:
What happens when decision networks encounter multimodal instruction?
We explore enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a “read-to-play” capability.
Paper:
[1/n]
[1/n]
🚀🚀🚀 Excited to share our latest work: "CodeEditorBench:Evaluating Code Editing Capability of Large Language Models"!
### 🧐 Highlights of the CodeEditorBench:
> 8K meticulously collected code editing questions from five sources: namely
[1/n]
Happy to unveil the pioneering work behind "Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model". Our study presents CT-LLM, a model engineered with a deep focus on the Chinese language, marking a significant shift from traditional models that often
[1/n]
Happy to share our new work "MuPT: A Generative Symbolic Music Pretrained Transformer", encompassing a series of music generation models ranging from 190 million to 4.2 billion parameters, all based on the ABC Notation. According to human preference evaluations, our models
Personally speaking, I believe that the paradigm of clearly spliting pretraining and alignment won't last long. There are tons of valuable instruction corpora or plain extracted text but with manually understandable implicit motivations and structures existing in the nowadays
MAmmoTH2: Scaling Instructions from the Web
- Proposes a paradigm to efficiently harvest 10M instruction data from web corpus to enhance LLM reasoning
- 11% -> 34% on MATH and 36% -> 67% on GSM8K
proj:
abs:
Thanks for sharing our work!
Congrats to Tianle and other team members!
In-Context Learning Capacity is a strong reference of whether Long-context LLMs can really achieve compositional reasoning as they can when the context is shorter. Needle in a haystack focuses more on
Long-context LLMs Struggle with Long In-context Learning
Suggests a notable gap in current LLM capabilities for processing and understanding long, context-rich sequences.
We are delighted to announce CMMMU, A Chinese Massive Multi-discipline Multimodal Understanding Benchmark. CMMMU is inspired by and strictly follows the pattern of MMMU. CMMMU starts the race of Bilingual LMMs on complex reasoning!
MAP-Neo is another good example!()
We release all resources needed to be on par with Mistral v0.2 and surpass LLaMA-2(code/pretrain corpus/data cleaning pipeline/etc) with the similar size.
We are also glad to announce that we open-source the phase-2 SFT
We should call models like Llama 3, Mixtral, etc. “open-weight models”, not “open-source models”. For a model to be open-source, the code and training data need to be public (good examples: GPT-J, OLMo, RedPajama, StarCoder, K2, etc.). Weights are like an exe file, which would be
🚀 Excited to announce that the tech report of MAP-Neo (): a fully open-source and transparent bilingual LLM suite with superior performance to bridge the gap with closed-source models, is now available:
🔧MAP-Neo's workflow
Thanks for tweeting COIG-CQIA! lol!
Yea, indeed, we devote heavy human effort to manually collect and clean SFT corpora from Ruozhiba, A Chinese Sarcastic Joke Forum, and it indeed makes Yi-34B better!
COIG-CQIA is a scaled-up Chinese version LIMA to support Open-Source SFT for
[1/n]
New Paper Alert! Funny II-Bench Coming!
Excited to introduce II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models. Check it out:
In this study, we aims to evaluate the MLLMs’s higher-order perception of images.
I'm heading to Vienna after taking a nice travel and break.
I will present MAmmoTH (MAmmoTH-2 coming soon as well) and have discussions about 4 of my recent works (MERT, MALMEN, STABLE-Alignment and MAmmoTH) with my amazing friends, mentors, and colleagues!
I also bring a big
Glad to announce that StructLM and CT-LLM are accepted to
#COLM2024
.
CT-LLM verifies that Chinese, as any other languages besides English, can generalize to emergent ability of other languages.
StructLM is the SoTA foundation model for tackling different
Delighted to have two papers accepted to
#COLM2024
.
The first paper is StructLM. It's the SoTA foundation model for tackling all different types of structure knowledge grounding tasks like tableQA, KBQA,etc.
This work was led by an undergrad
@alexzhuang_
To be honest, a reviewer should be qualified to review a pretrain related paper only if she/he is a core member of at least one of an Open Large Language Model. Otherwise, it is really disappointing, for a LLM conference, like COLM.
@SpokespersonCHN
Winged words indeed! P.P. 10043 not only destroy young scholars' academic dream but also destroy people's confidence in academic integrity and racist condition in US.
Check out our recent work on MMMU (massive multi-task multimodal understanding)!
Multimodal Foundation Model should be able to understand the images and reason like LLM on MMLU. It's the time to take the next step towards a better Multimodal!
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.
🧐 Highlights of the MMMU benchmark:
> 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks
>
Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
- Presents CT-LLM, a 2B LLM
- Open-sourcing the full process of training, including a detailed data processing procedure
hf:
abs:
MuPT: A Generative Symbolic Music Pretrained Transformer
Presents a series of pre-trained models for symbolic music generation based on Llama architecture
proj:
abs:
OpenCodeInterpreter
Integrating Code Generation with Execution and Refinement
The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems
Glad to see that Claude-3.5-Sonnet performs amazing on our recently released II-Bench () and significantly surpasses the existing MLLMs!
It really understands human's humors and visual implications!
M-A-P/Neo-7B-Instruct is the 1st 💎fully-open💎 LLM on WildBench leaderboard and its performance is awesome. "Fully open-source" here means that all data for pre-training & post-training are open, code is open-source, in addition to the public model weights! As
@percyliang
Amazing!
A real step towards a General World Model that:
1. Simulates world states by generating videos across any domains
2. Allows any-time control with free-text actions
🔥Introducing Pandora 🌏 🪐
a World Model that generates videos of world states with real-time language control 🎥🕹️
Simulate the world across domains in an _interactive_ way!
check out more
@agihippo
Maybe I'm ignorant. The data retrieval pipeline from Deepseek, the WSD LRS from MiniCPM, detailed tech report of Qwen/Yi/Deepseek/baichuan/skyword/... There are a lot of solid contributions from Chinese Open-source LLM community with relatively "limited" computational resource.
DataComp-LM: In search of the next generation of training sets for language models
- Provides a corpus of 240T tokens from Common Crawl
- Trains a LM using their filtered dataset, which performs similarly on NLU tasks w/ 6.6x less compute than Llama 3 8B
proj:
Kudos to the Qwen Team! The SOTA-Level Code LLM should be capable of editing codes to satisfy programmers' requests indeed!
Try out the amazing CodeQwen and our CodeEditorBench!
(4/n) 🧵 CodeQwen are Debuggers
In assessing CodeQwen1.5’s proficiency in code modification tasks, we concentrated our evaluation on the CodeEditorBench suite, encompassing four distinct dimensions: Debugging, Translation, Language Switching, and Code Polishing. The results
My personal favorite idea in the past 2024 on letting more people benefit from LLMs:
Domain-specific LLM is a real thing! There are tons of traditional companies feeling bad sharing their private data and on the way building their own LLMs. But they are with very limited
Excited to share our latest research paper! 📄📷 In this study, we explore Scaling Law in Domain-specific Continual Pre-training scenarios. Our findings reveal the relationship between model performance and mixture ratios. Check it out here:
#ScalingLaw
A preliminary release for researchers of the Chinese community to play with it. Will keep updating it and welcome all collaborators to contribute to it. The general translated corpus doesn't use OpenAI API so it's capable of commercial use as well.
Yinghao
@nicolaus625
will present MERT() at Hall B
#282
from 10:45 am to 12:45 am today.
I will present MAmmoTH() and discuss about MAmmoTH2() at Hall B
#122
from 4:30 pm to 6:30 pm today.
Come and Chat with
The experiment results of MAmmoTH reveal that different math metrics do not necessarily improve simultaneously but can all benefit from transfer learning of a well-designed mixed-up instruction tuning set. It's very exciting to be part of the work!
Excited to introduce our latest math generalist model MAmmoTH 🦣, built through instruction tuning. We proposed hybrid "chain-of-thought" & "program-of-thought" training to supercharge LLMs' math reasoning capabilities. 🦣 beats the open SoTA by 20+% on many datasets like MATH.
Thanks a lot! LLM360 team has done amazing exploration work for the whole fully open LLM community by developing amber and crystal, and further scaling to K2! All fully open LLM teams share the same motivation purchasing LLM Democratization and Real Open Science!
🎉 Congratulations to an awesome fully open source model, by the m-a-p team!
Paper: 📎
Includes great info on:
-Data Curation
-Infra details
-Intermediate checkpoints
-Scaling law
LLM360 is happy to work with this thriving community on open source AI.
StructLM
Towards Building Generalist Models for Structured Knowledge Grounding
Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of large language models (LLMs) on plain text, their
Please welcome K2-65B🏔️, the most performant fully-open LLM released to date.
As a blueprint for open-source AGI, we release all model checkpoints, code, logs, and data.
About K2:
🧠65 billion parameters
🪟Fully transparent & reproducible
🔓Apache 2.0
📈Outperforms Llama 2 70B
🔥Thrilled to announce 📽️VideoScore, the first-ever fine-grained and reliable evaluator/reward model for text-to-video generation tasks, which is trained on 🎞️VideoFeedback, a large-scale and fine-grained human-feedback dataset for text-to-video (T2V) generations.
🤔Why
[3/n]
MMLU, CMMLU, and CEVAL appear to assess overlapping capabilities of the models, leading to similar performance trends. Maybe bilingual LLMs can only measure one of them during the pretrain process.
Accelerating the Development of Large Multimoal Models with LMMs-Eval
Repo:
Blog:
We are offering a one command evaluation API for fast and thorough evaluation of LMMs over 39 datasets (increasingly).
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Presents an any-to-any multimodal LM that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music
proj:
abs:
[1/n]
New Benchmark Alert!
LongIns () is a little "brother" of LongICLBench (), but it provides a possibility of more dynamically verifying LLM's long-context reasoning performance.
Each sample in LongIns is composed of multiple
🔟Kudos to my co-leads: Scott Qu,
@liujiaheng2
,
our advisors: Jiajun Zhang, Wanli Ouyang,
@HuangRubio
, and
@WenhuChen
, and solid contributions from the whole team!
1/ Excited to announce the release of our new paper "MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training"! We propose a self-supervised music understanding model, attaining overall SOTA performance on 14 MIR tasks.
[5/n]
For the Amber-7b model, there is a noticeable decline in capability in the 200b-300b token range, likely due to the pretrain corpus. Our hypothesis is that Amber pretrain corpus is not deduplicated well.
@SpoxCHNinUS
A lot of comments here complaining about not taking foreign students back to China, but without given specific discriminatory policy. Instead, PP10043 is a clear systematic racist discriminatory policy. If USA a country with better democracy, why not just set an example?
We’ve released two new music understanding models: and , which are trained with up to 160K hrs 24K Hz audio. Our models give strong performance on various (≥ 8) music info retrieval tasks. Paper coming soon! 🧵🔛
[2/n]
We collect 20,150 raw images from various renowned illustration websites. After the carefully designed three-stage data filtration procedure — image deduplication, text-to-image ratio control and human review, we get 1,222 images and 1,434 questions.
II-Bench comprises
[6/n]
It is observable that all models improve their abilities in tasks involving math, physical interaction understanding and commonsense reasoning in a relatively synchronized manner.
[6/n]
Introduction CHC-Bench, a MTbench-like benchmark for evaluating model’s understanding of Chinese culture, history, traditions, humanities, geography, and STEM in eight main Categories.
Thrilled to work with
@alexzhuang_
@WenhuChen
@bigaidream
on the amazing StructLM!
Try our generalized language model for structured knowledge grounding tasks!
[1/n]
Excited to share StructLM🏗️, a series of models fine-tuned to generalize over structured knowledge grounding tasks.
paper:
- We achieve SoTA on 7/18 SKG tasks
- On held out tasks, our 7B model 0-shot is 30% better than 1-shot ChatGPT-3.5.
[7/n]
Upon analyzing the graphs, it is evident that while the trend of increasing performance with larger datasets is present, the actual scores for each model at various training checkpoints do not precisely align with the expected trajectory of the scaling law.
Kudos to the Team! Glad to see that it achieves 37.9 on our CMMMU and 36.6 on MMMU, which is amazing!
Try out our CMMMU on and .
Let's begin the Chinese MModal Competition!
[1/5] 🚀 Announcing DeepSeek-VL, sota 1.3B and 7B visual-language models!
Paper:
GitHub:
📚 Diverse training corpus
👯 Hybrid Vision Encoder
🧠 3-stage training strategy
🆓 Totally free for commercial use and fully open-source
[7/n]
We also reproduce the pipeline introduced by deepseek-math to retrieve high-quality data from large massive pretrain corpus. The pipeline is available here:
@sivil_taram
Believe that it‘s positive, if the query distribution is diverse enough. Several papers mention the importance of unifying the tone, like Yi Tech Report.
[3/n]
Our findings as followed:
1. A significant difference exists in performance between humans and MLLMs: the highest accuracy achieved by the model is 74.8%, whereas the average accuracy for humans is 90%, with the highest reaching 98%.
2. Closed-source models often outperform
@XueFz
Yes. It might be a general problem that all LLM researchers meet when they share their papers, not only COLM. Some traditional ML or NLP researchers have difficulty getting the point. Only if you train it, you know what is hard and valuable.
The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis
undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints
Our newest paper released: . It's essential to probe and mitigate discrimination based on humanity's beliefs. 🐶 explores how manually annotated data serves as a benchmark for gender bias probing and mitigation.
@yizhilll
@bigaidream
@chenghua_lin
Emerged or not emerged, that may not be the right question to ask for Theory of Mind (ToM) in
#LLM
. In our theme track paper in the Findings of
#EMNLP2023
@emnlpmeeting
, we asked ourselves (1) what constitutes a machine ToM? (2) How to better evaluate ToM in LLMs?...🧵[1/n]
[4/n]
Overview of CodeEditorBench. CodeEditorBench evaluates programming languages by selecting initial data from five sources and filtering based on code length. It enriches the dataset with Large Language Model-generated test cases, which, along with all code, are verified by
@WenhuChen
@RylanSchaeffer
The token number of each global batch is 8M. We use 1024 global token size and 8192 context length.
In the decay stage, we use 640 batch size and 8192 context length, the token number of each global batch is 5.24M.
[6/n]
## Benchmark results:
Evaluating LLMs on CodeEditorBench. All results of models are generated by greedy
decoding. Code Debug, Code Translate and Code Requirement Switch are evaluated with pass
@1
,
while Code Polish is evaluated with Mean OptScore. Values outside parentheses
[11/n]
We want to remind the Open Source researchers that we miss the Amber officially released intermediate ckpts(). And we'll also include the results of Amber/OLMo/Pythia in our next version of paper. Thanks for the reminder from
@BlancheMinerva
.
@billyuchenlin
Thanks for including our MAP-Neo-7B-Instruct-v0.1() in the amazing WildBench()!
Glad to see that MAP-Neo performs well on it given its 7B size!
True Open-Source Power!
[5/n]
CodeEditorBench delineates the spectrum of code editing tasks, including Code Debugging, Code Translating, Code Polishing, and Code Requirement Switching. Each dataset entry shares similar attributes such as title, difficulty, public and private test inputs and outputs, as