TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

πŸ“… 2025-07-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address excessive inference overhead in Large Vision-Language Models (LVLMs) caused by redundant visual tokens, this paper proposes a training-free, efficient token pruning method. The core innovation lies in introducing Token Transition Variation (TTV), a novel importance metric defined based on the dynamic changes in magnitude and direction of token representations across transformer layersβ€”a first-of-its-kind characterization. To mitigate positional bias inherent in standard attention mechanisms, the method further incorporates Instruction-Guided Attention (IGA). TTV can be applied independently or synergistically with IGA for dynamic, instruction-aware pruning. Evaluated on LLaVA-v1.5 and LLaVA-Next, the approach maintains original performance across eight multimodal benchmarks while reducing inference FLOPs by over 50%, significantly enhancing inference efficiency without architectural or training modifications.

Technology Category

Application Category

πŸ“ Abstract
Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.
Problem

Research questions and friction points this paper is trying to address.

Reduce computational costs in large vision-language models
Identify important tokens without positional bias
Improve inference efficiency via token pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Transition Variation for pruning
Instruction-Guided Attention for importance
Training-free efficient token pruning
πŸ”Ž Similar Papers
No similar papers found.