Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the quadratic computational overhead, limited context length, and high first-token latency inherent in visual language models (VLMs) for long-video understanding, this paper proposes a plug-and-play, training-free dynamic video pruning method. Our approach identifies spatially and temporally static, redundant patches via fine-grained change detection and—without altering original positional encodings—dynamically skips low-information frame blocks during inference. To ensure robustness across varying compression ratios, we employ randomized pruning strength during auxiliary robustness training. Evaluated on multiple video understanding benchmarks, our method achieves up to 4× reduction in first-token latency with less than 1% accuracy degradation, enables longer video inputs, and significantly improves inference throughput. It is architecture-agnostic and compatible with mainstream VLMs, establishing a practical, scalable paradigm for efficient video-language understanding.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches -- spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.
Problem

Research questions and friction points this paper is trying to address.

Reduces token redundancy in video sequences
Enables faster inference for video-language models
Maintains accuracy while pruning temporally static patches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes temporally static patches to reduce tokens
Preserves positional identity without retraining
Uses stochastic pruning rates for robustness
🔎 Similar Papers
No similar papers found.
N
Natan Bagrov
E
Eugene Khvedchenia
Borys Tymchenko
Borys Tymchenko
NVIDIA
LLM optimizationVision-Language ModelsComputer VisionDeep Learning
S
Shay Aharon
L
Lior Kadoch
T
Tomer Keren
O
Ofri Masad
Yonatan Geifman
Yonatan Geifman
NVIDIA
Machine LearningDeep Learning
R
Ran Zilberstein
T
Tuomas Rintamaki
M
Matthieu Le
Andrew Tao
Andrew Tao
Nvidia
Computer VisionMachine Learning