ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video large language models (VLLMs) suffer from high computational overhead during the prefilling stage due to the massive number of visual tokens; existing attention pruning methods, when applied to shallow decoder layers, incur substantial performance degradation—especially under high compression ratios. Method: We propose the first fine-tuning-free shallow-layer attention pruning framework for VLLMs, incorporating three key mechanisms: (i) segment-aware causal masking to preserve temporal coherence, (ii) position debiasing correction to mitigate positional encoding distortion, and (iii) redundant token elimination to suppress visually similar tokens. Our method dynamically selects critical tokens based on attention scores while preserving original model weights. Contribution/Results: Extensive experiments across multiple video understanding benchmarks demonstrate that our approach achieves up to 3.2× inference speedup with negligible accuracy loss—matching full-attention baselines—thereby breaking the long-standing accuracy-efficiency trade-off barrier in shallow-layer attention pruning.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational load in video large language models
Improves pruning effectiveness in shallow decoder layers
Maintains performance under high compression rates without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-aware causal masking for token selection
Positional debiasing to reduce encoding bias
Token deduplication for enhanced pruning efficiency
🔎 Similar Papers
No similar papers found.
Yingjie Xia
Yingjie Xia
Zhejiang University
computer sciences
T
Tao Liu
VCIP & TMCC & DISSec, College of Computer Science, Nankai University
Jinglei Shi
Jinglei Shi
Nankai University
deep learning3D visionlight fieldvideo processingcompression
Q
Qingsong Xie
OPPO Research Institute
H
Heng Guo
Beijing University of Posts and Telecommunications
J
Jian Yang
VCIP & TMCC & DISSec, College of Computer Science, Nankai University
X
Xi Wang
LIX, Ecole Polytechnique, IP Paris