🚀NExT-Chat🚀: An LMM for Chat, Detection and Segmentation
All of the demo code, training code, evaluation code, and model weights are released at .
This a large multimodal model for chat, detection and segmentation as shown in the demo video:
So sad to hear the news ()😰. The conclusion of our investigation:
1. Llama3-V can be run using MiniCPM-Llama3-V 2.5's code and config.json after changing param names
2. It behaves similarly to MiniCPM-Llama3-V 2.5 in unrevealed experimental features
One of the experimental features of MiniCPM-Llama3-V 2.5 is recognizing Tsinghua Bamboo Characters (清华简), a very special and rare type of Chinese ancient characters written on bamboo during China's Warring States Period (475 BC-221 BC). These training images are recently
🌟In this work, we propose a VPGTrans for building vision-language LLM (VL-LLM) at a lower cost (e.g. 10%)🚀🚀🚀.
With VPGTrans, we built VL-LLaMA and VL-Vicuna.
Welcome try out our VL-Vicuna:
Codes:
Paper:
After receiving the issue from
@yangzhizheng1
on GitHub, we launched a serious investigation. We can obtain inference results correctly using Llama3-V checkpoint with MiniCPM-Llama3-V 2.5's code and config file following
@yangzhizheng1
's instruction on GitHub. Even more, we also
For quantative results, we also test several Llama3-based VLMs on 1K Bamboo Character images and compared the prediction exact match for each pair of models.
The overlaps between every two models are zero, whereas the overlaps between Llama3-V and MiniCPM-Llama3-V 2.5 achieve a
The same thing also happens to WebAgent, another unrevealed feature trained on in-house data. They even make identical errors in a WebAgent schema newly defined within our team...
🚀 Excited to introduce MiniCPM-Llama3-V 2.5! With 8B parameters, it’s our latest breakthrough, outperforming top models like GPT-4V. 📈
💪 Superior OCR capabilities
🔑 Supports 30+ languages
HuggingFace:
GitHub:
Big congrats to my friend Kai's excellent paper💯!
It is inspiring to empower the retrieval model with knowledge from the website data and LLM. Wondering will this be used in the Google search engine later😁?
Proud to present 🔍MagicLens: image retrieval models following open-ended instructions.
🌟Highlights of 🔍MagicLens:
>🧐Novel Insights: Naturally occurring image pairs on the same web page contain diverse image relations (e.g., inside and outside views
@mervenoyann
@Microsoft
Great share! Also make a recommendation for our recent work🔥NExT-Chat🔥that can output both boxes and segmentation masks! All code is released recently at .
Real-Time Video Generation: Achieved 🥳
Share our latest work with
@JxlDragon
,
@VictorKaiWang1
, and
@YangYou1991
: "Real-Time Video Generation with Pyramid Attention Broadcast."
3 features: real-time, lossless quality, and training-free!
Blog: (🧵1/6)
Will the trend of scaling up AI systems end quickly🤔?
I do not think so. Imagine that there is a model&algorithm that can get a human-like AI with only 1 A100 in 1 day.
What will the big companies do? I think they will just scale up the data and model to seek super AI😂
Since the HuggingFace page of Llama3-V is removed now, we upload both Llama3-V and MiniCPM-V checkpoints () for comparison. Since this model has received several thousands of downloads on HuggingFace, there should be independent copies to reproduce this.
[p1] 🐕Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward🐕
Paper link: page:
How to effectively train video large multimodal Model (LMM) alignment with preference modeling?
😭😭😭A lesson from 2023 in research: do remember to give a good title to your paper!
In May, I release a paper with a similar idea with MiniGPT-4, where I build 🚀VL-Vicuna🚀 for MM conversation.
However, I have a much bad title (VPGTrans) and promotion. Finally:
@siddkaramcheti
@SurajNair_1
@ashwinb96
@percyliang
@tkollar
@DorsaSadigh
However, when I recently train a LMM+detection+segmentation model using both VQA and detection data, I also notice that more epochs are required. So, I am curious about whether the multiple epoch training will lead to some qualitative difference.
1. In this work we explore the transferability of visual prompt generator (VPG) across LLMs, such that one can easily create a novel high-performance vision-language LLM (VL-LLM) without training from scratch in prohibitively expensive costs.
2. We propose a VPGTrans, an VPG transfer framework that is simple yet highly effective.
This work also contribute by building two novel VL-LLMs, including VL-LLaMA and VL-Vicuna.