🤖 AI Summary
Current AI models’ capabilities in understanding and generating cinematographic techniques remain poorly characterized, primarily due to the absence of high-quality, expert-annotated data. Method: We introduce CineBench—the first multimodal benchmark dedicated to cinematography—spanning seven dimensions: shot scale, camera angle, composition, camera motion, lighting, color, and focal length. It comprises 600+ expert-annotated images and 120+ video clips. We propose a structured multidimensional prompting scheme, an image-text alignment question-answering evaluation protocol, and a condition-driven video reconstruction framework. Contribution/Results: Our benchmark enables the first unified evaluation of 15+ multimodal large language models and 5+ video generation models. Experiments uncover systematic deficiencies in semantic cinematographic modeling and physically plausible camera motion synthesis. CineBench provides a reproducible evaluation standard, diagnostic tools, and concrete directions for advancing AI-assisted film creation.
📝 Abstract
Cinematography is a cornerstone of film production and appreciation, shaping mood, emotion, and narrative through visual elements such as camera movement, shot composition, and lighting. Despite recent progress in multimodal large language models (MLLMs) and video generation models, the capacity of current models to grasp and reproduce cinematographic techniques remains largely uncharted, hindered by the scarcity of expert-annotated data. To bridge this gap, we present CineTechBench, a pioneering benchmark founded on precise, manual annotation by seasoned cinematography experts across key cinematography dimensions. Our benchmark covers seven essential aspects-shot scale, shot angle, composition, camera movement, lighting, color, and focal length-and includes over 600 annotated movie images and 120 movie clips with clear cinematographic techniques. For the understanding task, we design question answer pairs and annotated descriptions to assess MLLMs' ability to interpret and explain cinematographic techniques. For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements given conditions such as textual prompts or keyframes. We conduct a large-scale evaluation on 15+ MLLMs and 5+ video generation models. Our results offer insights into the limitations of current models and future directions for cinematography understanding and generation in automatically film production and appreciation. The code and benchmark can be accessed at https://github.com/PRIS-CV/CineTechBench.