VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the high computational cost of multimodal large language models (MLLMs) when processing high-resolution images and videos, which stems from the excessive number of visual tokens. Existing compression methods often neglect alignment with textual context, leading to performance degradation. To overcome this, we propose VisionTrimβ€”a training-free, unified framework for visual token compression that achieves efficient acceleration through two plug-and-play modules: Dominant Visual Token Selection (DVTS), which preserves essential global and local information, and Text-Guided Visual Complementarity (TGVC), which, for the first time, incorporates textual context into the compression process to enable cross-modal alignment. Extensive experiments demonstrate that VisionTrim significantly outperforms existing approaches across multiple image and video benchmarks, substantially reducing computational overhead while maintaining or even improving model performance, thereby facilitating the practical deployment of MLLMs.

Technology Category

Application Category

πŸ“ Abstract
Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Vision Token Compression
Computational Cost
Text-Visual Alignment
Training-Free Acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Token Compression
Training-Free Acceleration
Multimodal LLM
Text-Guided Merging
Plug-and-Play Framework