We are presenting **SAD** (Segment Any RGBD):
SAD is able to perform 3D segmentation (segment out any 3D object) with RGBD inputs (or rendered depth images only).
- Code:
- Demo
@huggingface
:
🔥MoCap Anybody🔥
#NeurIPS2023
We propose *SMPLer-X*, the first generalist foundation model for 3D/4D human motion capture from monocular inputs.
- Project:
- Paper:
- Code:
- Demo:
Check out **RAM** (Relate Anything Model) !
- We empower Segment Anything Model (SAM) with the capability to recognize various visual relations between different visual concepts.
- Code:
- Demo
@huggingface
:
Thrilled to announce **Otter**, a multi-modal in-context learning model with instruction tuning:
1) Chatbot w/ image, video, 3D
2) Only need 4x 3090 GPUs
3) Better than OpenFlamingo
- Code:
- Demo:
- Video:
🔥Open-Source Video Diffusion Transformer🔥
Similar to
#Sora
, our ☕️Latte☕️ model also has *video diffusion transformer* architecture, with thorough study on the design space of *LDM + DiT* for video generation
- Project:
- Code:
Introducing Sora, our text-to-video model.
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
Prompt: “Beautiful, snowy
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
paper page:
Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS). Though promising results have been
We are releasing,
#OpenSelfSup
, an open-source library for self-supervised learning:
- *All methods in one repository*, supporting PIRL, MoCo, SimCLR, etc.
- *All benchmarks in one repository*.
- *Efficiency*, supporting multi-GPU distributed training.
We have released, DeepFashion-MultiModal, a large-scale high-quality human dataset with rich multi-modal annotations:
- 40K high-resolution human images
- Human parsing, keypoints and DensePose annotations
- Attribute and textual description annotations
🔥🔥 We propose
#DreamGaussian
, a Generative Gaussian Splatting framework that produces high-quality textured meshes in just 2 minutes from a single-view image.
- Project:
- Paper:
- Code:
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
paper page:
Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS). Though promising results have been
🔥Text-to-3D Foundation Model🔥
We are excited to announce
#3DTopia
, a generalist 🧊text-to-3D🧊 foundation model, which produces ** high-quality 3D assets within 5 minutes **
- Code:
- Video:
🔥Lumina-Next🔥 is a stronger and faster high-res text-to-image generation model. It also supports 1D (music) and 3D (point cloud) generation
- T2I Demo:
http://106.14.2.150:10020/
- Code:
- Report:
- Video:
📢Motion Capture from Any Video📢
The
@Gradio
demo for 🚀SMPLer-X🚀 (foundation model for monocular 3D human motion capture) is now online thanks to
@_akhaliq
!
- Project:
- Code:
- Online Demo
@huggingface
:
🔥MoCap Anybody🔥
#NeurIPS2023
We propose *SMPLer-X*, the first generalist foundation model for 3D/4D human motion capture from monocular inputs.
- Project:
- Paper:
- Code:
- Demo:
#MotionDiffuse
(Diffusion Model for Motion) now has both Colab and
@huggingface
demos. Feel free to play and generate your favorite animation clip using text :)
- Code:
- Colab Demo:
-
@Gradio
Demo:
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
abs:
project page:
propose MotionDiffuse, the first diffusion model-based text-driven motion generation framework
📢LLaVA-NeXT(-Video) Announced📢
* LLaVA-NeXT is one of the most competitive open-source VLM nowadays towards GPT4-V
* LLaVA-NeXT-Video extends this capability to long videos, outperforming all existing video LLMs
- Blog:
- Code:
LLaVA-NeXT is Expanded: Support Larger Models and Video Tasks
-🖼️Stronger LLMs Supercharge Multimodal Capabilities in the Wild
🗞️Blog 1:
-📽️A Strong Zero-shot Video Understanding Model
🗞️Blog 2:
-🌠Code:
🔥Unbounded 3D City Generation🔥
#CVPR2024
We propose 🏙️CityDreamer🏙️, a compositional generative model for synthesizing unbounded 3D cities
@CVPR
- Project:
- Code:
- Demo
@Gradio
: , thanks to
@_akhaliq
!
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
paper page:
In recent years, extensive research has focused on 3D natural scene generation, but the domain of 3D city generation has not received as much exploration. This is due to the
"Segment Any Point Cloud Sequences by Distilling Vision Foundation Models"
We introduce **Seal**, a novel framework that harnesses vision foundation models for segmenting diverse automotive point cloud sequences.
- Paper:
- Code:
We just launched a general video interaction platform based on LLMs, namely **Dolphin**.
Dolphin is a chatbot that can interact with videos, spanning from video understanding to generation/editing.
- Code:
- Demo
@huggingface
:
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
abs:
project page:
propose MotionDiffuse, the first diffusion model-based text-driven motion generation framework
🔥Large-Vocabulary 3D Diffusion Model with Transformer🔥
#DiffTF
generates massive categories of real-world 3D objects with a single feed-forward diffusion model
- Paper:
- Project:
- Code:
🤩Music to 3D Duet Dance Generation🤩
#ICLR2024
We propose 🕺Duolando💃, a GPT-based model that autoregressively predicts 3D motion for both leader and follower dancer
@iclr_conf
- Project:
- Paper:
- Code:
"Learning without Forgetting for Vision-Language Models"
* We propose Project Fusion (PROOF) that enables VLMs to learn without forgetting.
* Task-specific projections based on the frozen image/text encoders and multi-modal fusion are the keys
- Paper:
The code of *SceneDreamer* has been open-sourced. Come and create your own consistent 3D world with only 2D image collections :)
- Project:
- Code:
- Demo
@huggingface
:
#ICLR2023
"Voxurf: Voxel-based Efficient and Accurate Neural Surface Reconstruction"
Our Voxurf is accepted to
@iclr_conf
as **spotlight** presentation, achieving higher reconstruction quality with 20x speedup.
- Paper:
- Code:
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
abs:
project page:
github:
TL;DR: AvatarCLIP generate and animate avatars given descriptions of body shapes, appearances and motions.
We propose, MotionDiffuse, the first diffusion model-based text-driven motion generation framework, which enables 1) probabilistic mapping, 2) realistic synthesis and 3) multi-level manipulation.
- Code will be available at:
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
abs:
project page:
propose MotionDiffuse, the first diffusion model-based text-driven motion generation framework
We have released an
@OpenMMLab
human pose and shape estimation toolbox "MMHuman3D":
- Popular methods with a modular framework
- Various datasets with a unified data convention
- Versatile visualization toolbox
Welcome to use and contribute
#MMHuman3D
#CVPR2023
Our "Prompting in Vision" Tutorial was a huge success. Thanks so much to our amazing speakers and all the participants!
- The tutorial slides and recordings will be uploaded to our tutorial website:
Our
#CVPR2020
**oral** paper "Self-Supervised Scene De-occlusion":
Paper:
Code:
Demo:
- A self-supervised framework that can recover occluded objects & their spatial orders from a single RGB image.
🚀🚀We present
#HumanGaussian
, an efficient **Text-to-3D Human** framework that generates high-quality 3D human (geo. and texture) from text only
- Project:
- Paper:
- Code:
- Video:
Our work has been accepted to
#CVPR2022
:
- ViT generally has better OOD generalization ability than CNN under various distribution shifts.
- Incorporating DA techniques (e.g. adversarial learning, minimax entropy and SSL) into ViT further boosts its generalization ability.
"Delving Deep into the Generalization of Vision Transformers under Distribution Shifts":
Paper:
Code:
- We investigate the OOD generalization of vision transformers.
- We integrated domain adaptation techniques into transformers.
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
paper page:
High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks
🎞️Reenact Any Character in Movie🎞️
#NeurIPS2023
🔥SMPLer-X🔥 is the first foundation model for monocular 4D motion capture. Combining
#SMPLerX
and
#Propainter
to make your own *La La Land*!
- Code (SMPLer-X):
- Code (Propainter):
🔥MoCap Anybody🔥
#NeurIPS2023
We propose *SMPLer-X*, the first generalist foundation model for 3D/4D human motion capture from monocular inputs.
- Project:
- Paper:
- Code:
- Demo:
🔥🔥We propose
#SEINE
, a video diffusion model that focuses on generative transition and prediction.
#SEINE
supports *video transition generation* and *image-to-video animation*
- Project:
- Paper:
- Code:
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
paper page:
Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips
🔥Mamba with Longer Context🔥
We present 🐍LongMamba🐍, an early exploration of Mamba's **longer context extrapolation ability**. Our
#LongMamba
manages to retrieve *nearly perfectly* on a window context of 16384.
- Code:
- Model:
🤩3D Generation Arena🤩
Lots of 3D generation models come out recently. But which one is preferable from human perception?
** Welcome to play with ~20
#3DGen
models in our arena with both text-to-3D and image-to-3D
@_akhaliq
- 3DGen-Arena
@huggingface
:
📢📢Excited to release 3DGen-Arena, an open 3D Benchmarking platform.
⚔️Two tracks: Text-to-3D & Image-to-3D.
🎯Nineteen models: 9 for Text & 13 for Image.
🏆The Leaderbord is waiting for your votes!
Let's play with 3D models and vote at !
We have released a series of 3D/4D human datasets (including
#GTAHuman
and
#HuMMan
) in *OpenXDLab*. Please feel free to check them out:
- GTA-Human:
- HuMMan:
- MMHuman3D
@OpenMMLab
:
DreamGaussian4D: Generative 4D Gaussian Splatting
paper page:
Remarkable progress has been made in 4D content generation recently. However, existing methods suffer from long optimization time, lack of motion controllability, and a low level of detail. In
We propose 🚀HyperHuman🚀, a hyper-realistic human image generation foundation model with better quality than Stable Diffusion XL.
- Project:
- Paper:
- Code:
- Demo:
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
paper page:
Despite significant advances in large-scale text-to-image models, achieving hyper-realistic human image generation remains a desirable yet unsolved task. Existing
🔥Embodied vision-language agent that can program itself to play GTA🔥
🐙Octopus🐙, embodied VLM to plan intricate action sequences and generate executable code in complex env.
- Project:
- Paper:
- Code:
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
paper page:
Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. Furthermore, when seamlessly integrated into an embodied
Our
#CVPR2021
paper "Deep Animation Video Interpolation in the Wild":
Paper:
Code:
- An effective animation video interpolation framework as well as a large-scale animation triplet dataset (ATD-12K).
📢Text-to-3D Foundation Model📢
Our
#3DTopia
has major updates, with 1) newly released technical report, and 2) our own *refined captions* for the Objaverse quality set
- Code:
- Paper:
- Refined Objaverse:
🔥🔥MMBench: Is Your Multi-modal Model an All-around Player?
🧭MMBench🧭 is a systematically-designed benchmark for evaluating the various abilities of large multimodal models.
- Project:
- Paper:
- Code:
Thrilled to announce our 👨🎤Digital Life Project👩🎤
🔥🔥Autonomous 3D Characters with Social Intelligence
* All the stories, interactive dialogs, passive/active body motions in the demo video are generated by AI
- Project:
- Video:
🔥Large-Vocabulary 3D Diffusion Model🔥
#ICLR2024
🎯DiffTF🎯 generates massive (>200) categories of real-world 3D objects with a single feed-forward 3D diffusion model
@iclr_conf
- Project:
- Paper:
- Code:
"Delving Deep into the Generalization of Vision Transformers under Distribution Shifts":
Paper:
Code:
- We investigate the OOD generalization of vision transformers.
- We integrated domain adaptation techniques into transformers.
🔥Interactive Text-to-Texture Synthesis🔥
We present
#InTeX
, an interactive framework for 3D text-to-texture synthesis, with *region repainting* and *real-time editing on laptop*
- Project:
- Paper:
- Code:
#CVPR2023
Our F2-NeRF is accepted to
@CVPR
as **highlight**. F2-NeRF enables 1) arbitrary input camera trajectories for novel view synthesis and 2) only costs a few minutes for training.
- More results:
- Code will be released at:
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
paper page:
Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips
🔥Benchmarking
#Sora
Quantitatively🔥
We perform a *preliminary evaluation* of
#Sora
on our 📊VBench📊.
#Sora
undoubtedly outperforms all existing models, especially for "video quality" and "dynamic" dimension
- Code:
- Benchmark:
Introducing Sora, our text-to-video model.
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.
Prompt: “Beautiful, snowy
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
paper page:
In recent years, extensive research has focused on 3D natural scene generation, but the domain of 3D city generation has not received as much exploration. This is due to the
#CVPR2023
Our OmniObject3D is selected as **award candidate** (top 0.1%, 12 out of 9155)
@CVPR
Large-vocabulary high-quality real-scanned 3D objects for perception & generation.
- Project:
- Paper:
- Code:
🔥
#PSG4D
for Spatial Intelligence🔥
We introduce 4D Panoptic Scene Graph (
#PSG4D
) that bridges dynamic 4D sensory data and high-level space-time understanding
*Nodes are entities while edges are dynamic relations
- Paper:
- Code:
🔥Fine-Grained Text-to-Motion🔥
#NeurIPS2023
We present
#FineMoGen
, a diffusion-based and LLM-augmented framework that generates fine-grained motion with spatial-temporal prompt.
- Project:
- Paper:
- Code:
🔥Physics-based Text-to-Motion🔥
#NeurIPS2023
We present 👨🎤InsActor👩🎤, a generative framework that produces *diffusion policies* to synthesize motion for physics-based characters
- Project:
- Paper:
- Code:
InsActor: Instruction-driven Physics-based Characters
paper page:
Generating animation of physics-based characters with intuitive control has long been a desirable task with numerous applications. However, generating physically simulated animations that
#AvatarCLIP
(Text2Avatar Model) generates and animates 3D avatar from natural language description of body shape, appearance and motion.
- Project Page:
- Demo Video:
- Code:
1. In 2022, text-to-image tech has improved dramatically.
Heading in to 2023, text-to-mesh, text-to-video, and text-to-audio models have all been demonstrated.
Today we play fortuneteller and explain how in 2023 you'll likely be able to create full 3D characters from text.
🧵
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
paper page:
In recent years, extensive research has focused on 3D natural scene generation, but the domain of 3D city generation has not received as much exploration. This is due to the
🚀🚀Fancy generating high-res (2K~4K) images using Stable Diffusion without an additional super-resolution module?
* Now combining
#FreeU
with
#ScaleCrafter
, you can generate 4K images using SDXL for free!
- FreeU:
- ScaleCrafter:
FreeU: Free Lunch in Diffusion U-Net
paper page:
we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the
🔥FreeU now has a major upgrade🔥
* By adding structure-aware scaling, FreeU excels at both structural coherence and detail preservation, largely improving the aesthetic quality upon Stable Diffusion XL for free.
- Paper:
- Code:
FreeU: Free Lunch in Diffusion U-Net
paper page:
we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the
#SIGGRAPH2022
We present a text-driven controllable framework, Text2Human, for high-quality and diverse human generation from natural languages.
Welcome to try our
@siggraph
work with user interface at:
😼Move Anything in Your Picture😼
#CVPR2024
We propose 🏞️SceneDiffusion🏞️ to freely rearrange image layouts by layered scene diffusion
@CVPR
* It supports a wide range of spatial editing operations, e.g., moving, resizing and layer-wise editing
- Paper:
We propose, EVA3D, an unconditional 3D human generative model learned from 2D image collections.
#EVA3D
can sample 3D humans with detailed geometry and render high-quality images.
- Project:
- Code:
- Video:
🔥Segment Any Point Cloud Sequences🔥
#NeurIPS2023
We introduce 🦭Seal🦭, a novel framework that harnesses vision foundation models for segmenting diverse automotive point cloud sequences
@ldkong1205
@OpenMMLab
- Project:
- Code:
✨Consistent Video-to-Video Generation✨
#CVPR2024
We present 🎞️FRESCO🎞️ with *spatial-temporal correspondence* to produce high-quality coherent videos from text prompts
@CVPR
- Project:
- Paper:
- Code: 🎞️
FRESCO
Spatial-Temporal Correspondence for Zero-Shot Video Translation
The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to
We propose StyleFaceV to generate high-fidelity identity-preserving face videos with vivid movements.
- Our core insight is to decompose appearance/pose information and recompose them in StyleGAN3 to produce stable and dynamic results.
- Code:
📢
#ICLR2024
Welcome to check out our GenAI work
@iclr_conf
📢
* Image Gen
- HyperHuman:
* Video Gen
- SEINE:
- FreeNoise:
* 3D Gen
- DreamGaussian:
- DiffTF:
Excited to see that our new 🦦Otter🦦 model "OTTER-Image-MPT7B" ranks 🔥top🔥 on several large multimodal model evaluation benchmarks.
- Code, demo and checkpoints:
🔥🔥We propose
#VideoBooth
to enable **customized video generation** with image prompts, which provide more accurate and direct content control beyond the text prompts.
- Project:
- Code:
- Video:
VideoBooth: Diffusion-based Video Generation with Image Prompts
paper page:
Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with
🔥PointHPS: Cascaded 3D Human Pose and
Shape Estimation from Point Clouds🔥
* We propose 🚹PointHPS🚺 for accurate 3D human pose and shape estimation from real-world point clouds
- Project:
- Paper:
- Code:
🤩Theme-Aware 3D Asset Generation🤩
#SIGGRAPH
We present ⛽️ThemeStation⛽️, which synthesizes customized 3D assets based on few exemplars that exhibit a shared theme
@siggraph
- Project:
- Paper:
- Code:
ThemeStation
Generating Theme-Aware 3D Assets from Few Exemplars
Real-world applications often require a large gallery of 3D assets that share a consistent theme. While remarkable advances have been made in general 3D content creation from text or image, synthesizing
#SIGGRAPH2022
We propose, AvatarCLIP, a zero-shot text-driven framework for 3D avatar generation and animation.
- It empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages.
AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
abs:
project page:
github:
TL;DR: AvatarCLIP generate and animate avatars given descriptions of body shapes, appearances and motions.
#CVPR2023
"Collaborative Diffusion for Multi-Modal Face Generation and Editing"
Diffusion models collaborate to achieve multi-modal face generation without re-training
@CVPR
- Project:
- Paper:
- Code:
⚡️Dynamic 4D Human Rendering⚡️
#CVPR2024
Our 🏄♂️SurMo🏄♂️ learns dynamic 4D human rendering from videos by surface-based modeling of temporal dynamics and human appearances
@CVPR
- Project:
- Paper:
- Code:
We contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a diverse set of subjects, actions and scenarios.
- We discover that synthetic data provides critical complements to the real data.
- Data and models:
✨3D Human Diffusion Model✨
We present
#StructLDM
, a latent diffusion model with high-dimension structural latent space for 3D human generation
- Project:
- Paper:
- Code:
- Video:
Hate to say it, but AI girlfriends are definitely gonna be a thing.
StructLDM for instance lets you generate compositional and animatable humans by blending different body parts, identity swapping, local clothing editing, 3D virtual try-on, etc.
🔥Our multi-modal Otter has evolved🔥
#Otter
now supports the newly released
#Llama2
* We successfully trained a Flamingo-Llama2-Chat7B on CC3M in 5 hours using just 4 A100s
* The model showed promising zero-shot captioning skills
- Code and models:
🔥Video Generation with Image Prompts🔥
#CVPR2024
We propose *video generation with image prompts* 📽️VideoBooth📽️, providing more direct content control beyond text prompts
@CVPR
- Project:
- Paper:
- Code:
VideoBooth: Diffusion-based Video Generation with Image Prompts
paper page:
Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with
🔥Evaluating 3D Generation with GPT-4V🔥
With carefully designed instructions, GPT-4V serves as an automatic 3D generation evaluator that *strongly aligns with human preference*
- Project:
- Paper:
- Code:
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
paper page:
Despite recent advances in text-to-3D generative methods, there is a notable absence of reliable evaluation metrics. Existing metrics usually focus on a single criterion
🔥Large Multi-View Gaussian Model (LGM)🔥
We introduce
#LGM
, a feed-forward foundation model for text-to-3D and image-to-3D, which generates high-res 3D content in 5s
- Project:
- Code:
- Demo
@huggingface
:
LGM
Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
paper page:
3D content creation has achieved significant progress in terms of both quality and speed. Although current feed-forward models can produce 3D objects in seconds,
#NeurIPS2023
We propose 🔥PrimDiffusion🔥, a volumetric primitives diffusion model for 3D human generation, enabling explicit pose, view, and shape control with off-body topology.
- Project:
- Code:
- Video:
🔥Text-to-3D Foundation Model🔥
We present
#3DTopia
, a two-stage text-to-3D foundation model. The first stage quickly generates 3D candidates; the second stage refines the chosen 3D asset with high quality.
- Code:
- Demo
@Gradio
:
We propose **DeepFake-Adapter**, which effectively adapts a pre-trained ViT by enabling high-level semantics from ViT to organically interact with global and local low-level forgeries from adapters.
- Paper:
- Code:
We are organizing the
#OmniObject3D
challenge
@ICCVConference
with two competition tracks:
1) Track 1: sparse-view 3D reconstruction
2) Track 2: 3D object generation
- Challenge period: Aug 1 - Sep 15, 2023
- Homepage:
- CodaLab:
#ICCV2023
We present
#Text2Performer
to generate high-resolution vivid human videos with articulated motions from text prompts
@ICCVConference
.
- Project:
- Paper:
- Code:
- Demo:
🔥OmniObject3D Update🔥
We released the fine-grained textural descriptions for
#OmniObject3D
, which are manually annotated from 5 aspects: *summary*, *appearance*, *material*, *style* and *function*.
- Project:
- Code and data:
#ICCV2023
We present 🔥StyleGANEX🔥
@ICCVConference
, a next-generation StyleGAN architecture that can render unaligned images/videos for in-the-wild editing, SR and stylization.
- Project:
- Paper:
- Code:
#ICCV2023
We propose 🔥SHERF
@ICCVConference
, the first *generalizable* Human NeRF model for recovering *animatable* 3D human from a single image.
- Project:
- Paper:
- Code:
- Demo:
🔥Long Context from Langugae to Vision🔥
#LongVA
can process 2000 frames or over 200K visual tokens with SoTA performance on Video-MME among 7B models
- Paper:
- Code:
- Demo
@Gradio
: . Thanks to
@_akhaliq
!
🤩Long Video Assistant (LongVA): Breakthrough in long 🎥video understanding!
- Transfers long context capability from language to vision 🧠
- Only opensource model supporting 384 input frames🤩
- Handles 2000+ frames (200K+ visual tokens) 🤯
- SoTA on Video-MME among 7B models
-
🔥One-Stop Evaluation Suite of Large Multimodal Models (LMM)🔥
We present 📊lmms-eval📊, one command evaluation API for thorough evaluation of LMMs over 40 datasets.
- Code:
- Blog:
- Datasets
@huggingface
:
Accelerating the Development of Large Multimoal Models with LMMs-Eval
Repo:
Blog:
We are offering a one command evaluation API for fast and thorough evaluation of LMMs over 39 datasets (increasingly).
#ICCV2023
We present 🔥SparseNeRF
@ICCVConference
that synthesizes novel views given few images.
SparseNeRF distills local depth ranking prior from real-world depth observations.
- Project:
- Paper:
- Code: