🤖 AI Summary
This work addresses the lack of a systematic benchmark for open-domain video shot retrieval, which hinders modeling complex temporal structures and multimodal semantics. To this end, we introduce ShotFinder—the first open-domain shot retrieval benchmark featuring five types of single-factor controllable constraints: temporal dynamics, color, visual style, audio, and resolution. We further propose an “imagination-driven” three-stage retrieval paradigm: (1) video imagination and query expansion using large language models, (2) candidate video recall via search engines, and (3) description-guided temporal localization with multimodal models. Evaluation on 1,210 high-quality YouTube samples reveals that current models significantly underperform humans under color and visual style constraints, highlighting critical gaps in fine-grained semantic alignment and temporal understanding among multimodal foundation models.
📝 Abstract
In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.