Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the quadratic computational overhead, limited context length, and high first-token latency inherent in visual language models (VLMs) for long-video understanding, this paper proposes a plug-and-play, training-free dynamic video pruning method. Our approach identifies spatially and temporally static, redundant patches via fine-grained change detection and—without altering original positional encodings—dynamically skips low-information frame blocks during inference. To ensure robustness across varying compression ratios, we employ randomized pruning strength during auxiliary robustness training. Evaluated on multiple video understanding benchmarks, our method achieves up to 4× reduction in first-token latency with less than 1% accuracy degradation, enables longer video inputs, and significantly improves inference throughput. It is architecture-agnostic and compatible with mainstream VLMs, establishing a practical, scalable paradigm for efficient video-language understanding.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

Problem

Research questions and friction points this paper is trying to address.

Reduces token redundancy in video sequences

Enables faster inference for video-language models

Maintains accuracy while pruning temporally static patches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes temporally static patches to reduce tokens

Preserves positional identity without retraining

Uses stochastic pruning rates for robustness

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval