PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low decoding efficiency and high memory overhead in Large Vision-Language Models (LVLMs) caused by redundant visual tokens, this paper proposes a hierarchical, head-wise fine-grained visual token pruning method. Our approach innovatively introduces inter-layer dynamic retention rate allocation and head-level independent pruning, leveraging the re-attention phenomenon of visual tokens in self-attention to enable differentiated token removal. We further integrate KV cache compression to enhance inference efficiency. Evaluated on standard multimodal benchmarks, our method maintains strong multimodal understanding capability while achieving an 18% speedup in decoding, reducing KV cache size by over 50%, and incurring only a marginal performance drop of 0.46% in overall accuracy. Notably, it even yields slight improvements on multi-image tasks. These results demonstrate significant gains in both inference efficiency and deployment feasibility for LVLMs.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the Vision Token Re-attention phenomenon across decoder layers, we dynamically adjust token retention rates layer by layer. Layers that exhibit stronger attention to visual information preserve more vision tokens, while layers with lower vision attention are aggressively pruned. Furthermore, PLPHP applies pruning at the attention head level, enabling different heads within the same layer to independently retain critical context. Experiments on multiple benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of 0.46% average performance drop, while also achieving notable performance improvements in multi-image tasks. These results highlight the effectiveness of fine-grained token pruning and contribute to advancing the efficiency and scalability of LVLMs. Our source code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Enhance LVLMs inference efficiency
Reduce visual tokens in decoding
Implement fine-grained pruning method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token retention rates
Head-level independent pruning
Layer-level rate allocation
🔎 Similar Papers
No similar papers found.
Y
Yu Meng
Shenzhen International Graduate School, Tsinghua University
Kaiyuan Li
Kaiyuan Li
Beijing University Of Posts and Telecommunications
Sequential RecommendationLarge Recommendation ModelComputational Advertising
C
Chenran Huang
Shenzhen International Graduate School, Tsinghua University, Tongji University
C
Chen Gao
Tsinghua University
X
Xinlei Chen
Shenzhen International Graduate School, Tsinghua University
Y
Yong Li
Tsinghua University
Xiaoping Zhang
Xiaoping Zhang
China National Bamboo Research Center
BambooSoil ecologyMetagenomics