Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Current video generation evaluation frameworks lack the granularity required for professional film production. To address this, we propose Stable Cinemetrics—the first structured evaluation framework tailored to professional video generation—featuring a novel decoupled four-level cinematic control taxonomy (Scene/Event/Illumination/Camera) encompassing 76 fine-grained control nodes. We construct a benchmark dataset via expert film annotations, design an automated prompt classification and question-generation pipeline, and develop a vision-language automatic evaluator grounded in expert annotations, integrating zero-shot evaluation with hierarchical assessment. Extensive evaluation across 10+ models, 20,000 generated videos, and 80+ film professionals reveals critical deficiencies in current models—particularly in Event and Camera control—and demonstrates that our evaluator significantly outperforms existing baselines.

Technology Category

Application Category

📝 Abstract

Recent advances in video generation have enabled high-fidelity video synthesis from user provided prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics, a structured evaluation framework that formalizes filmmaking controls into four disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera. Together, these taxonomies define 76 fine-grained control nodes grounded in industry practices. Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases and develop an automated pipeline for prompt categorization and question generation, enabling independent evaluation of each control dimension. We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls. To enable scalable evaluation, we train an automatic evaluator, a vision-language model aligned with expert annotations that outperforms existing zero-shot baselines. SCINE is the first approach to situate professional video generation within the landscape of video generative models, introducing taxonomies centered around cinematic controls and supporting them with structured evaluation pipelines and detailed analyses to guide future research.

Problem

Research questions and friction points this paper is trying to address.

Existing models fail to meet professional video generation complexity requirements

Current benchmarks lack structured evaluation for cinematic controls like Camera and Events

There is no automated framework for scalable professional video quality assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical taxonomies formalize filmmaking controls

Automated pipeline enables independent control dimension evaluation

Vision-language model outperforms zero-shot evaluation baselines

🔎 Similar Papers

Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model

2024-07-31Citations: 3

Apple

San Francisco, United States of America

Senior Machine Learning Engineer, Video Quality Systems

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence