TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from high computational overhead in processing visual tokens, and existing token compression methods either require costly retraining or incur severe performance degradation after compression. Method: This paper proposes a training-free, plug-and-play two-stage visual token compression framework. It first identifies a strong correlation between the information decay rate in attention output matrices and downstream performance drop, then introduces an information-preservation-driven paradigm for token selection and merging—comprising Information-Preservation-Guided Selection (IPGS) and dynamic pruning-and-merging. Contribution/Results: Evaluated across 11 benchmarks and two mainstream MLLMs, the method compresses visual tokens to 22.2% of the original count, achieves 1.23× inference speedup, reduces KV cache usage by 64%, and incurs only a 1.54% accuracy drop—substantially outperforming both training-based and training-free compression baselines.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) are becoming increasingly popular, while the high computational cost associated with multimodal data input, particularly from visual tokens, poses a significant challenge. Existing training-based token compression methods improve inference efficiency but require costly retraining, while training-free methods struggle to maintain performance when aggressively reducing token counts. In this study, we reveal that the performance degradation of MLLM closely correlates with the accelerated loss of information in the attention output matrix. This insight introduces a novel information-preserving perspective, making it possible to maintain performance even under extreme token compression. Based on this finding, we propose TokenCarve, a training-free, plug-and-play, two-stage token compression framework. The first stage employs an Information-Preservation-Guided Selection (IPGS) strategy to prune low-information tokens, while the second stage further leverages IPGS to guide token merging, minimizing information loss. Extensive experiments on 11 datasets and 2 model variants demonstrate the effectiveness of TokenCarve. It can even reduce the number of visual tokens to 22.2% of the original count, achieving a 1.23x speedup in inference, a 64% reduction in KV cache storage, and only a 1.54% drop in accuracy. Our code is available at https://github.com/ShawnTan86/TokenCarve.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of visual tokens in MLLMs

Maintains performance under extreme token compression

Proposes TokenCarve for efficient, training-free token compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free token compression framework

Information-Preservation-Guided Selection strategy

Two-stage token pruning and merging

🔎 Similar Papers

VoCo-LLaMA: Towards Vision Compression with Large Language Models