Shot-Aware Frame Sampling for Video Understanding

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing frame sampling methods for long video understanding struggle to balance global coverage with the capture of transient critical events, limiting downstream task performance. This work proposes InfoShot—a task-agnostic, shot-aware frame sampler that first segments videos into semantically coherent clips using shot boundary detection and then selects two complementary keyframes from each clip: one representing dominant content and the other capturing anomalous changes. By optimizing an information-theoretic objective, InfoShot preserves both structural context and sparse biases without requiring model retraining. To evaluate short-term anomaly detection, we introduce SynFlash, a controllable synthetic benchmark. Experiments demonstrate that under strict frame budgets, InfoShot significantly improves anomaly hit rates and video question-answering accuracy, achieving competitive or superior performance against strong baselines on standard benchmarks.

Technology Category

Application Category

📝 Abstract

Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it improves the chance of preserving both overall video context and short decision-critical moments without requiring any retraining. To better evaluate such short-lived events, we further introduce SynFlash, a synthetic benchmark with controllable sub-second anomaly patterns and frame-level ground truth, and we also evaluate InfoShot on existing anomaly datasets and general video understanding tasks. Experiments show that InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.

Problem

Research questions and friction points this paper is trying to address.

video frame sampling

long-video understanding

critical events

shot-aware

Vision-Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

shot-aware sampling

information-theoretic frame selection

keyframe complementarity