EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high token overhead and low inference efficiency of Video Large Language Models (VLMs) on long videos, this paper proposes EventSTU—a training-free, event-guided spatiotemporal compression framework. Methodologically, it innovatively leverages the asynchronous triggering property of event cameras to design a joint spatiotemporal keyframe adaptive sampling scheme and zero-cost spatial token pruning mechanism; it further introduces a question-relevance-driven dynamic computational budget allocation strategy. To standardize evaluation, we construct EventBench—the first multimodal benchmark featuring human-annotated event data. Experiments demonstrate that, atop the strongest baseline, EventSTU achieves a 3.01× reduction in FLOPs and a 3.10× speedup in prefill latency, while preserving or even improving video understanding performance. Crucially, EventSTU supports generalizable understanding of both real and synthetic event signals.

Technology Category

Application Category

📝 Abstract
Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.
Problem

Research questions and friction points this paper is trying to address.

Reducing high inference costs in video large language models
Eliminating redundant frames using event-guided temporal sampling
Pruning spatial tokens adaptively using event-based visual saliency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Event-guided keyframe sampling reduces temporal redundancy
Event-based token pruning optimizes spatial representation
Question-aware budget allocation integrates spatio-temporal efficiency
🔎 Similar Papers
No similar papers found.
Wenhao Xu
Wenhao Xu
Unknown affiliation
X
Xin Dong
University of Science and Technology of China
Y
Yue Li
University of Science and Technology of China
H
Haoyuan Shi
University of Science and Technology of China
Zhiwei Xiong
Zhiwei Xiong
University of Science and Technology of China
computational photographybiomedical image analysis