🤖 AI Summary
This work addresses the significant challenges in cinematic shot language understanding (SLU), which stem from the multidimensional nature of film and the subjectivity inherent in expert interpretations, leading to a pronounced cognitive gap between existing vision-language models (VLMs) and human experts. To bridge this gap, the authors introduce SLU-SUITE, a comprehensive benchmark and training suite comprising 490,000 human-annotated question-answer pairs spanning six cinematic dimensions and 33 distinct tasks. They further present the first systematic diagnosis of VLM bottlenecks in SLU across modular components and cross-dimensional task interactions. Building on these insights, two general-purpose solutions are proposed: UniShot, a unified model trained with dynamically balanced data mixing, and AgentShots, an expert ensemble leveraging prompt-based routing. The proposed approaches outperform specialized ensemble models on in-domain tasks and surpass leading commercial VLMs by up to 22% on out-of-domain evaluations.
📝 Abstract
Shot language understanding (SLU) is crucial for cinematic analysis but remains challenging due to its diverse cinematographic dimensions and subjective expert judgment. While vision-language models (VLMs) have shown strong ability in general visual understanding, recent studies reveal judgment discrepancies between VLMs and film experts on SLU tasks. To address this gap, we introduce SLU-SUITE, a comprehensive training and evaluation suite containing 490K human-annotated QA pairs across 33 tasks spanning six film-grounded dimensions. Using SLU-SUITE, we originally observe two insights into VLM-based SLU from: the model side, which diagnoses key bottlenecks of modules; the data side, which quantifies cross-dimensional influences among tasks. These findings motivate our universal SLU solutions from two complementary paradigms: UniShot, a balanced one-for-all generalist trained via dynamic-balanced data mixing, and AgentShots, a prompt-routed expert cluster that maximizes peak dimension performance. Extensive experiments show that our models outperform task-specific ensembles on in-domain tasks and surpass leading commercial VLMs by 22% on out-of-domain tasks.