VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead of large multimodal models (LMMs) caused by redundant visual tokens, this paper proposes a vision-information-flow-guided dynamic token pruning framework. Unlike existing methods that rely solely on attention scores, our approach innovatively integrates attention-context relevance with patch-wise image entropy to construct a fine-grained importance map. We further design a progressive pruning module with a token recovery mechanism, optimized end-to-end to minimize semantic representation discrepancy before and after pruning. By synergistically incorporating KV-cache optimization and explicit vision–language interaction modeling, our method achieves near-lossless performance—maintaining model accuracy even when retaining only 10% of visual tokens—while reducing KV-cache memory consumption by 89% and accelerating inference by 3.8×. This work advances efficient LMM inference through principled, learnable visual token compression grounded in both contextual semantics and local information density.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning strategies tailored to different LMMs. Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8 times faster inference.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in LMMs by pruning redundant visual tokens
Improves token pruning strategies to minimize performance degradation
Optimizes pruning hyperparameters using visual information flow guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual information flow-guided importance map derivation
Progressive pruning module with recycling mechanism
Attention and entropy-based token retention strategy