🤖 AI Summary
Existing film cinematography understanding benchmarks—particularly ShotBench—and mainstream models like ShotVL suffer from ambiguous answer-option design, inconsistent reasoning behavior, and poor instruction adherence, undermining evaluation reliability and hindering fair model comparison. To address this, we systematically diagnose large language models’ reasoning patterns on this task and propose RefineShot: a refined benchmark that (1) reconstructs ShotBench with structured multiple-choice items, (2) introduces reasoning-path consistency analysis, and (3) incorporates instruction-alignment evaluation. RefineShot establishes a joint evaluation framework balancing overall accuracy with core competencies—namely, narrative technique identification, logical coherence, and instruction following. Experimental results demonstrate that RefineShot significantly improves assessment robustness and discriminative power, effectively exposing critical weaknesses in current models. It thus provides a more reliable, interpretable, and actionable benchmark for advancing film understanding research.
📝 Abstract
Cinematography understanding refers to the ability to recognize not only the visual content of a scene but also the cinematic techniques that shape narrative meaning. This capability is attracting increasing attention, as it enhances multimodal understanding in real-world applications and underpins coherent content creation in film and media. As the most comprehensive benchmark for this task, ShotBench spans a wide range of cinematic concepts and VQA-style evaluations, with ShotVL achieving state-of-the-art results on it. However, our analysis reveals that ambiguous option design in ShotBench and ShotVL's shortcomings in reasoning consistency and instruction adherence undermine evaluation reliability, limiting fair comparison and hindering future progress. To overcome these issues, we systematically refine ShotBench through consistent option restructuring, conduct the first critical analysis of ShotVL's reasoning behavior, and introduce an extended evaluation protocol that jointly assesses task accuracy and core model competencies. These efforts lead to RefineShot, a refined and expanded benchmark that enables more reliable assessment and fosters future advances in cinematography understanding.