🤖 AI Summary
This work addresses the high computational cost and deployment challenges of video large language models (VLLMs) stemming from their massive input token counts. The authors propose the first hierarchical pruning framework tailored for VLLMs, leveraging the spatio-temporal structure of video—comprising clips and frames—and the unidirectional multimodal information flow within the model. The framework dynamically compresses visual redundancy across three levels: clip-level temporal-spatial merging, frame-level diversity-preserving pruning, and layer-wise progressive redundancy reduction. Remarkably, by retaining only 30% of the original tokens, the method achieves new state-of-the-art results on four mainstream video understanding benchmarks, maintaining over 98% of LLaVA-Video-7B’s performance and exceeding 99% of LLaVA-OneVision-7B’s performance.
📝 Abstract
Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.