HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and deployment challenges of video large language models (VLLMs) stemming from their massive input token counts. The authors propose the first hierarchical pruning framework tailored for VLLMs, leveraging the spatio-temporal structure of video—comprising clips and frames—and the unidirectional multimodal information flow within the model. The framework dynamically compresses visual redundancy across three levels: clip-level temporal-spatial merging, frame-level diversity-preserving pruning, and layer-wise progressive redundancy reduction. Remarkably, by retaining only 30% of the original tokens, the method achieves new state-of-the-art results on four mainstream video understanding benchmarks, maintaining over 98% of LLaVA-Video-7B’s performance and exceeding 99% of LLaVA-OneVision-7B’s performance.
📝 Abstract
Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.
Problem

Research questions and friction points this paper is trying to address.

Video Large Language Models
token pruning
computational burden
video understanding
information redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical pruning
video large language models
token reduction
multi-modal information propagation
computational efficiency
🔎 Similar Papers
No similar papers found.
Y
Yansong Guo
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Fujian, China
Chaoyang Zhu
Chaoyang Zhu
CSE, Hong Kong University of Science and Technology
Multimodal Learning & Reasoning
Jiayi Ji
Jiayi Ji
Rutgers University
Jianghang Lin
Jianghang Lin
Xiamen University
Multimodal Large Language ModelVision-Language ModelSemi/Weakly-Supervised Learning
L
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, Xiamen, Fujian, China