π€ AI Summary
This work addresses the high computational cost of multimodal large language models (MLLMs) when processing high-resolution images and videos, which stems from the excessive number of visual tokens. Existing compression methods often neglect alignment with textual context, leading to performance degradation. To overcome this, we propose VisionTrimβa training-free, unified framework for visual token compression that achieves efficient acceleration through two plug-and-play modules: Dominant Visual Token Selection (DVTS), which preserves essential global and local information, and Text-Guided Visual Complementarity (TGVC), which, for the first time, incorporates textual context into the compression process to enable cross-modal alignment. Extensive experiments demonstrate that VisionTrim significantly outperforms existing approaches across multiple image and video benchmarks, substantially reducing computational overhead while maintaining or even improving model performance, thereby facilitating the practical deployment of MLLMs.
π Abstract
Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.