EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
Current evaluation methods for video generation are largely confined to basic prompt adherence—assessing whether outputs are “correct”—while neglecting critical dimensions of cinematic excellence such as aesthetic quality, performative expressiveness, and audiovisual coherence that determine whether results are truly “excellent.” Moreover, existing automated metrics lack professional credibility. To bridge this gap, this work proposes EvalVerse, the first framework to formalize expert knowledge from professional film production into a structured evaluation taxonomy. Leveraging large-scale expert-annotated data, EvalVerse employs expert-calibrated fine-tuning of vision-language models combined with chain-of-thought reasoning to enable a qualitative leap from correctness to excellence. The framework supports fine-grained diagnostics for multi-shot sequences and complex audiovisual tasks, significantly enhancing assessment fidelity for professional-grade video quality while remaining compatible with conventional metrics, thereby establishing a reliable infrastructure for reward modeling and intelligent evaluation.
📝 Abstract
The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.
Problem

Research questions and friction points this paper is trying to address.

cinematic video generation
evaluation benchmark
aesthetic quality
expert calibration
automated metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

pipeline-aware evaluation
expert-calibrated VLM
cinematic video generation
Chain-of-Thought reasoning
aesthetic benchmarking
🔎 Similar Papers
No similar papers found.
S
Songlin Yang
The Hong Kong University of Science and Technology, Tencent
H
Haobin Zhong
Tencent
R
Ruilin Zhang
Tsinghua University
X
Xiaotong Zhao
Tencent
Shuai Li
Shuai Li
Tencent
ML InfraMultimodalRL
Kai Zheng
Kai Zheng
Tencent Hunyuan X
Machine LearningNatural Language Processing
X
Xuyi Yang
The Hong Kong University of Science and Technology, Tencent
Zhe Wang
Zhe Wang
The Hong Kong University of Science and Technology
Atmospheric chemistryHeterogeneous ChemistrySOA formationCloud-aerosol-gas interaction
Z
Zhenchen Tang
Tencent, Institute of Automation, Chinese Academy of Sciences
Yang Li
Yang Li
Institute of Automation, Chinese Academy of Sciences
MLLMAgentbrain-inspired intelligenceArtificial intelligence
B
Bohai Gu
The Hong Kong University of Science and Technology, Tencent
Z
Zhengwei Peng
Tencent
Y
Yidan Huang
Beijing Film Academy
M
Mengzhou Luo
Beijing Film Academy
Y
Yihang Bo
Beijing Film Academy
Dalu Feng
Dalu Feng
Beijing Film Academy
Yujia Zhang
Yujia Zhang
Tencent
Computer VisionVideo Understanding
J
Juntao Ma
Tencent
R
Ruiqi Wang
Tencent
L
Lvmin Zhang
Stanford University
Yuwei Guo
Yuwei Guo
MMLab, The Chinese University of Hong Kong
Computer VisionGenerative AIVideo Generation
F
Frank Guan
Singapore Institute of Technology
Maneesh Agrawala
Maneesh Agrawala
Stanford University
GraphicsComputer GraphicsHCIVisualization
Hongbo Fu
Hongbo Fu
Professor and Acting Head, Arts and Machine Creativity, HKUST
Computer GraphicsHuman-Computer InteractionComputer Vision
A
Alan Zhao
Tencent