ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a systematic benchmark for open-domain video shot retrieval, which hinders modeling complex temporal structures and multimodal semantics. To this end, we introduce ShotFinder—the first open-domain shot retrieval benchmark featuring five types of single-factor controllable constraints: temporal dynamics, color, visual style, audio, and resolution. We further propose an “imagination-driven” three-stage retrieval paradigm: (1) video imagination and query expansion using large language models, (2) candidate video recall via search engines, and (3) description-guided temporal localization with multimodal models. Evaluation on 1,210 high-quality YouTube samples reveals that current models significantly underperform humans under color and visual style constraints, highlighting critical gaps in fine-grained semantic alignment and temporal understanding among multimodal foundation models.

Technology Category

Application Category

📝 Abstract
In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.
Problem

Research questions and friction points this paper is trying to address.

open-domain video shot retrieval
temporal structure
multimodal large models
video semantics
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

open-domain video shot retrieval
keyframe-oriented description
controllable constraints
video imagination
temporal localization
🔎 Similar Papers
No similar papers found.
Tao Yu
Tao Yu
Institute of Automation, Chinese Academy of Sciences
MLLM
H
Haopeng Jin
CASIA
Hao Wang
Hao Wang
Institute of Automation,Chinese Academy of Sciences; Sun Yat-Sen University
Multimodal LearningComputer VisionSegmentation
S
Shenghua Chai
CASIA
Y
Yujia Yang
UCAS
J
Junhao Gong
Peking University
Jiaming Guo
Jiaming Guo
Institute of Computing Technology, Chinese Academy of Sciences
Artificial intelligenceReinforcement Learning
M
Minghui Zhang
CASIA
X
Xinlong Chen
CASIA, UCAS
Zhenghao Zhang
Zhenghao Zhang
Florida State University
communication networks
Yuxuan Zhou
Yuxuan Zhou
清华大学
Medical NLPLarge Language Model
Y
Yanpei Gong
CASIA
Y
YuanCheng Liu
CASIA
Y
Yiming Ding
CASIA
K
Kangwei Zeng
CASIA
Pengfei Yang
Pengfei Yang
Institute of Software, Chinese Academy of Sciences
Probabilistic model checkingDNN verification
Z
Zhongtian Luo
CASIA
Y
Yufei Xiong
CASIA
S
Shanbin Zhang
CASIA, UCAS
S
Shao-Yong Cheng
CASIA
R
Ruilin Huang
CASIA
L
Liangxun Shuo
CASIA
Y
Yuxi Niu
CASIA
X
Xinyuan Zhang
CASIA
Y
Yueya Xu
CASIA
J
Jie Mao
CASIA
R
Ruixuan Ji
CASIA
Y
Yaru Zhao
CASIA
M
Mingchen Zhang
CASIA
J
Jiabing Yang
CASIA, UCAS
J
Jiaqi Liu
CASIA
H
Hongzhu Yi
UCAS
X
Xinming Wang
UCAS
C
Cheng Zhong
Lenovo
X
Xiao Ma
Lenovo
Zhang Zhang
Zhang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer Vision
Yan Huang
Yan Huang
Institute of Automation, Chinese Academy of Sciences
computer visiondeep learningmultimodal learning
Liang Wang
Liang Wang
National Lab of Pattern Recognition
Computer VisionPattern RecognitionMachine Learning