Learning Compact Vision Tokens for Efficient Large Multimodal Models

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the high computational cost and quadratic complexity induced by long visual token sequences in large multimodal models (LMMs), this paper proposes a synergistic framework combining Spatial Token Fusion (STF) and Multi-Block Token Fusion (MBTF). By retaining only 25% of visual tokens, the method effectively compresses sequence length while preserving multi-granularity semantic information. Built upon the LLaVA-1.5 architecture, it employs learnable token fusion, lightweight fine-tuning with a frozen vision encoder, and a multi-granularity feature supplementation mechanism—overcoming the adaptability limitations of conventional fixed encoders. Evaluated on eight mainstream vision-language benchmarks, the model matches or exceeds baseline performance while achieving substantial inference speedup. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only $25%$ vision tokens of baseline. The source code and trained weights are available at https://github.com/visresearch/LLaVA-STF.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of large multimodal models

Shortens vision token sequences for faster inference

Maintains performance while using fewer vision tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Token Fusion reduces token length

Multi-Block Token Fusion adds multi-granularity features

Combines STF and MBTF for efficient multimodal reasoning

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM