π€ AI Summary
Video large language models (VLMs) suffer from prohibitively high inference costs due to redundant video tokens, and existing token pruning methods fail to effectively capture spatiotemporal redundancy. Method: We propose Dynamic Density Pruning (DDP), the first framework integrating temporal segmentation with density-aware visual token selection to achieve structure-preserving token compression while maintaining spatiotemporal structural integrity. DDP jointly models temporal and visual contextual redundancy, enabling seamless integration with state-of-the-art VLMs such as LLaVA-OneVision and Video-LLaVA. Contribution/Results: Evaluated on multi-scale video understanding benchmarks, DDP achieves new state-of-the-art performance. After pruning 90% of video tokens, it retains 98.0% of the original modelβs accuracy, yielding substantial reductions in computational overhead without compromising structural fidelity or semantic expressiveness.
π Abstract
Video Large Language Models have shown impressive capabilities in video comprehension, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to fully exploit the spatiotemporal redundancy inherent in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging this insight, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision and LLaVA-Video. Notably, FastVID effectively prunes 90% of video tokens while retaining 98.0% of LLaVA-OneVision's original performance. The code is available at https://github.com/LunarShen/FastVID.