Ao Zhang@CVPR Profile
Ao Zhang@CVPR

@zhanga6

659
Followers
136
Following
10
Media
45
Statuses

3rd year Ph.D student @NUSingapore NExT Research Center. Previously at @TsinghuaNLP . Working on MLLM.

Central Region, Singapore
Joined January 2016
Don't wanna be here? Send us removal request.
Pinned Tweet
@zhanga6
Ao Zhang@CVPR
6 months
🚀NExT-Chat🚀: An LMM for Chat, Detection and Segmentation All of the demo code, training code, evaluation code, and model weights are released at . This a large multimodal model for chat, detection and segmentation as shown in the demo video:
3
15
53
@zhanga6
Ao Zhang@CVPR
1 month
So sad to hear the news ()😰. The conclusion of our investigation: 1. Llama3-V can be run using MiniCPM-Llama3-V 2.5's code and config.json after changing param names 2. It behaves similarly to MiniCPM-Llama3-V 2.5 in unrevealed experimental features
12
73
500
@zhanga6
Ao Zhang@CVPR
1 month
One of the experimental features of MiniCPM-Llama3-V 2.5 is recognizing Tsinghua Bamboo Characters (清华简), a very special and rare type of Chinese ancient characters written on bamboo during China's Warring States Period (475 BC-221 BC). These training images are recently
Tweet media one
3
7
85
@zhanga6
Ao Zhang@CVPR
1 year
🌟In this work, we propose a VPGTrans for building vision-language LLM (VL-LLM) at a lower cost (e.g. 10%)🚀🚀🚀. With VPGTrans, we built VL-LLaMA and VL-Vicuna. Welcome try out our VL-Vicuna: Codes: Paper:
Tweet media one
Tweet media two
Tweet media three
Tweet media four
2
16
50
@zhanga6
Ao Zhang@CVPR
1 month
@chrmanning @TsinghuaNLP @Stanford Thanks for recognizing our team’s work😊
0
0
34
@zhanga6
Ao Zhang@CVPR
1 month
After receiving the issue from @yangzhizheng1 on GitHub, we launched a serious investigation. We can obtain inference results correctly using Llama3-V checkpoint with MiniCPM-Llama3-V 2.5's code and config file following @yangzhizheng1 's instruction on GitHub. Even more, we also
1
2
33
@zhanga6
Ao Zhang@CVPR
1 month
For quantative results, we also test several Llama3-based VLMs on 1K Bamboo Character images and compared the prediction exact match for each pair of models. The overlaps between every two models are zero, whereas the overlaps between Llama3-V and MiniCPM-Llama3-V 2.5 achieve a
Tweet media one
1
2
30
@zhanga6
Ao Zhang@CVPR
1 month
The same thing also happens to WebAgent, another unrevealed feature trained on in-house data. They even make identical errors in a WebAgent schema newly defined within our team...
Tweet media one
1
1
27
@zhanga6
Ao Zhang@CVPR
22 days
I will be at #CVPR2024 from 16 Jun. to 22 Jun. Happy to meet old friends and make new friends😃. If you are interested in MLLM, let’s discuss!
0
2
18
@zhanga6
Ao Zhang@CVPR
2 months
Comparable to GPT-4V with only 8b param😱. Welcome to check out our new MiniCPM-Llama3-V 2.5.
@OpenBMB
OpenBMB
2 months
🚀 Excited to introduce MiniCPM-Llama3-V 2.5! With 8B parameters, it’s our latest breakthrough, outperforming top models like GPT-4V. 📈 💪 Superior OCR capabilities 🔑 Supports 30+ languages HuggingFace: GitHub:
Tweet media one
Tweet media two
1
36
80
0
0
5
@zhanga6
Ao Zhang@CVPR
3 months
Big congrats to my friend Kai's excellent paper💯! It is inspiring to empower the retrieval model with knowledge from the website data and LLM. Wondering will this be used in the Google search engine later😁?
@DrogoKhal4
Kai Zhang
3 months
Proud to present 🔍MagicLens: image retrieval models following open-ended instructions. 🌟Highlights of 🔍MagicLens: >🧐Novel Insights: Naturally occurring image pairs on the same web page contain diverse image relations (e.g., inside and outside views
14
59
185
1
1
3
@zhanga6
Ao Zhang@CVPR
7 months
@mervenoyann @Microsoft Great share! Also make a recommendation for our recent work🔥NExT-Chat🔥that can output both boxes and segmentation masks! All code is released recently at .
Tweet media one
1
0
3
@zhanga6
Ao Zhang@CVPR
13 days
Real-time video generation🤩🤩🤩. Congrats to Xuanlei and Kai.
@oahzxl
Xuanlei Zhao
13 days
Real-Time Video Generation: Achieved 🥳 Share our latest work with @JxlDragon , @VictorKaiWang1 , and @YangYou1991 : "Real-Time Video Generation with Pyramid Attention Broadcast." 3 features: real-time, lossless quality, and training-free! Blog: (🧵1/6)
17
141
639
0
0
3
@zhanga6
Ao Zhang@CVPR
7 months
@REVOLVO_OCELOTS @_akhaliq I think it can not work with llava. Open sourced models are not powerful enough to support Set of Mark prompting.
0
0
2
@zhanga6
Ao Zhang@CVPR
4 months
Will the trend of scaling up AI systems end quickly🤔? I do not think so. Imagine that there is a model&algorithm that can get a human-like AI with only 1 A100 in 1 day. What will the big companies do? I think they will just scale up the data and model to seek super AI😂
0
0
2
@zhanga6
Ao Zhang@CVPR
1 month
Since the HuggingFace page of Llama3-V is removed now, we upload both Llama3-V and MiniCPM-V checkpoints () for comparison. Since this model has received several thousands of downloads on HuggingFace, there should be independent copies to reproduce this.
0
0
2
@zhanga6
Ao Zhang@CVPR
6 months
@mervenoyann @Microsoft Great! I am truly excited and grateful for the opportunity!
0
0
1
@zhanga6
Ao Zhang@CVPR
1 month
@giffmana @OpenBMB thanks for sharing😃
0
0
1
@zhanga6
Ao Zhang@CVPR
3 months
Shocked by the performance💥! Also solve my long confuse about which version of ChatGPT to use for eval.
@RuohongZhang
Ruohong Zhang
3 months
[p1] 🐕Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward🐕 Paper link: page: How to effectively train video large multimodal Model (LMM) alignment with preference modeling?
Tweet media one
2
15
67
0
0
1
@zhanga6
Ao Zhang@CVPR
6 months
😭😭😭A lesson from 2023 in research: do remember to give a good title to your paper! In May, I release a paper with a similar idea with MiniGPT-4, where I build 🚀VL-Vicuna🚀 for MM conversation. However, I have a much bad title (VPGTrans) and promotion. Finally:
Tweet media one
1
0
1
@zhanga6
Ao Zhang@CVPR
3 months
@huybery congrats! Just contribute a new fan
0
0
1
@zhanga6
Ao Zhang@CVPR
1 month
@RylanSchaeffer @cfpark1997 thanks for the efforts!
0
0
1
@zhanga6
Ao Zhang@CVPR
1 month
@billyuchenlin thanks❤️!
0
0
1
@zhanga6
Ao Zhang@CVPR
4 months
@siddkaramcheti @SurajNair_1 @ashwinb96 @percyliang @tkollar @DorsaSadigh Great work!!! I have a small question regarding the multiple epoch training. Will the multiple epoch training cause some difference in qualitative performance?
1
0
0
@zhanga6
Ao Zhang@CVPR
4 months
@siddkaramcheti @SurajNair_1 @ashwinb96 @percyliang @tkollar @DorsaSadigh However, when I recently train a LMM+detection+segmentation model using both VQA and detection data, I also notice that more epochs are required. So, I am curious about whether the multiple epoch training will lead to some qualitative difference.
1
0
1
@zhanga6
Ao Zhang@CVPR
1 year
3. Our VPGTrans built models sometimes can even outperform the original ones.
Tweet media one
Tweet media two
0
1
1
@zhanga6
Ao Zhang@CVPR
1 year
1. In this work we explore the transferability of visual prompt generator (VPG) across LLMs, such that one can easily create a novel high-performance vision-language LLM (VL-LLM) without training from scratch in prohibitively expensive costs.
1
1
1
@zhanga6
Ao Zhang@CVPR
6 months
Project page:
0
0
1
@zhanga6
Ao Zhang@CVPR
6 months
method:
Tweet media one
0
0
1
@zhanga6
Ao Zhang@CVPR
7 months
🤯Every LMM (including my work) uses downstream tasks to finetune their model. I do not know how to really evaluate a LMM except for human evaluation.
0
0
1
@zhanga6
Ao Zhang@CVPR
4 months
@XiaohuaZhai @giffmana check your mail🙏🙏🙏
0
0
1
@zhanga6
Ao Zhang@CVPR
1 year
2. We propose a VPGTrans, an VPG transfer framework that is simple yet highly effective. This work also contribute by building two novel VL-LLMs, including VL-LLaMA and VL-Vicuna.
1
1
1