🤖 AI Summary
Existing shot boundary detection methods struggle to accurately identify subtle transitions due to limitations imposed by noisy, low-diversity human annotations and the absence of a modern, comprehensive evaluation benchmark. This work reframes the task as structured relational prediction and introduces Shot-Query Transformer—the first end-to-end framework that jointly models intra- and inter-shot relationships through a shot query mechanism to densely capture temporal dependencies among video frames. To address data scarcity and diversity, the authors also develop a fully automatic, parameterized pipeline for synthesizing video transitions, enabling large-scale, diverse training data generation. They further release OmniShotCutBench, a comprehensive benchmark spanning multiple domains. The proposed method achieves significant improvements in both accuracy and interpretability for hard and soft cuts, consistently outperforming existing approaches across the new benchmark.
📝 Abstract
Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing state-of-the-art methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation.