VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the high computational cost of multimodal large language models (MLLMs) when processing high-resolution images and videos, which stems from the excessive number of visual tokens. Existing compression methods often neglect alignment with textual context, leading to performance degradation. To overcome this, we propose VisionTrim—a training-free, unified framework for visual token compression that achieves efficient acceleration through two plug-and-play modules: Dominant Visual Token Selection (DVTS), which preserves essential global and local information, and Text-Guided Visual Complementarity (TGVC), which, for the first time, incorporates textual context into the compression process to enable cross-modal alignment. Extensive experiments demonstrate that VisionTrim significantly outperforms existing approaches across multiple image and video benchmarks, substantially reducing computational overhead while maintaining or even improving model performance, thereby facilitating the practical deployment of MLLMs.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Vision Token Compression

Computational Cost

Text-Visual Alignment

Training-Free Acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Token Compression

Training-Free Acceleration

Multimodal LLM