Towards Sparse Video Understanding and Reasoning

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and low inference efficiency in video question answering caused by redundant frames. To enable efficient sparse understanding, the authors propose a multi-round agent framework that dynamically selects key frames, maintains a summary state, and terminates reasoning early when sufficient evidence is gathered. A novel annotation-free EAGER reward mechanism is introduced, integrating confidence gain, summary adequacy, and early-stopping correctness to support plug-and-play deployment and reinforcement-based fine-tuning. Extensive experiments on multiple video QA benchmarks demonstrate that the method significantly reduces the number of processed frames, inference rounds, and prompt tokens while simultaneously improving accuracy, thereby validating the effectiveness and practicality of sparse reasoning in video understanding.

Technology Category

Application Category

📝 Abstract
We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play''setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.
Problem

Research questions and friction points this paper is trying to address.

Sparse Video Understanding
Video Question Answering
Efficient Reasoning
Frame Selection
Vision-Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse video reasoning
video question answering
reinforcement fine-tuning
EAGER reward
summary-as-state