LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead and memory bottlenecks in large vision-language models (VLMs) caused by redundant visual tokens, this work proposes a lightweight multimodal architecture that achieves efficient image and video understanding using only a single visual token. The core innovation is “modality pre-fusion”: through token importance analysis, we observe that visual information predominantly influences early layers of the LLM; thus, visual features are fused *before* the LLM’s first layer into the text token sequence, enabling extreme compression. The model comprises a vision encoder, a pre-fusion module, and a lightweight LLM. It outperforms LLaVA-v1.5 across 11 image and 7 video benchmarks. Computationally, it reduces FLOPs by 77%, achieves end-to-end response latency under 40 ms, and enables real-time processing of over 10,000 video frames on a single 24-GB GPU.

Technology Category

Application Category

📝 Abstract
The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
Problem

Research questions and friction points this paper is trying to address.

Efficient Model
Visual-Token Fusion
Resource Consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Tokenization
Early Fusion of Modalities
Reduced Computational Demand
🔎 Similar Papers
No similar papers found.
Shaolei Zhang
Shaolei Zhang
Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
Natural Language ProcessingLarge Language ModelMultimodal LLMsSimultaneous Translation
Qingkai Fang
Qingkai Fang
Institute of Computing Technology, Chinese Academy of Sciences
Large Language ModelsSpeech Language ModelsMultimodal LLMsSpeech Translation
Z
Zhe Yang
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); University of Chinese Academy of Sciences, Beijing, China
Y
Yang Feng
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China