FastVID: Dynamic Density Pruning for Fast Video Large Language Models

πŸ“… 2025-03-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Video large language models (VLMs) suffer from prohibitively high inference costs due to redundant video tokens, and existing token pruning methods fail to effectively capture spatiotemporal redundancy. Method: We propose Dynamic Density Pruning (DDP), the first framework integrating temporal segmentation with density-aware visual token selection to achieve structure-preserving token compression while maintaining spatiotemporal structural integrity. DDP jointly models temporal and visual contextual redundancy, enabling seamless integration with state-of-the-art VLMs such as LLaVA-OneVision and Video-LLaVA. Contribution/Results: Evaluated on multi-scale video understanding benchmarks, DDP achieves new state-of-the-art performance. After pruning 90% of video tokens, it retains 98.0% of the original model’s accuracy, yielding substantial reductions in computational overhead without compromising structural fidelity or semantic expressiveness.

Technology Category

Application Category

πŸ“ Abstract
Video Large Language Models have shown impressive capabilities in video comprehension, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to fully exploit the spatiotemporal redundancy inherent in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging this insight, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision and LLaVA-Video. Notably, FastVID effectively prunes 90% of video tokens while retaining 98.0% of LLaVA-OneVision's original performance. The code is available at https://github.com/LunarShen/FastVID.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in Video Large Language Models.
Addresses redundancy in video tokens for efficient deployment.
Maintains temporal and visual integrity while pruning video tokens.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Density Pruning reduces video token redundancy.
Temporal and visual context analysis enhances pruning efficiency.
FastVID maintains performance while cutting computational costs.
πŸ”Ž Similar Papers
No similar papers found.
Leqi Shen
Leqi Shen
Tsinghua University
G
Guoqiang Gong
JD.com
T
Tao He
GRG Banking Equipment Co., Ltd., South China University of Technology
Y
Yifeng Zhang
JD.com
P
Pengzhang Liu
JD.com
Sicheng Zhao
Sicheng Zhao
Tsinghua University
Affective ComputingMultimediaDomain AdaptationComputer Vision
Guiguang Ding
Guiguang Ding
Tsinghua University
Computer VisionMultimedia Retrieval