MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation evaluation methods are largely confined to single-shot scenarios and struggle to assess the narrative coherence and appeal of multi-shot long-form videos. To address this gap, this work proposes MSVBench—the first comprehensive benchmark tailored for multi-shot video generation—leveraging hierarchical scripts and reference images to establish a human-level evaluation framework. By integrating the semantic understanding of large multimodal models with the fine-grained perceptual capabilities of domain-specific expert models, the framework enables multi-level automatic assessment, achieving a Spearman correlation coefficient of 94.4% with human judgments. A lightweight model fine-tuned using the benchmark’s scalable supervision signals matches the performance of Gemini-2.5-Flash, further revealing that prevailing approaches often function as visual interpolators rather than genuine world models.

Technology Category

Application Category

📝 Abstract
The evolution of video generation toward complex, multi-shot narratives has exposed a critical deficit in current evaluation methods. Existing benchmarks remain anchored to single-shot paradigms, lacking the comprehensive story assets and cross-shot metrics required to assess long-form coherence and appeal. To bridge this gap, we introduce MSVBench, the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation. We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models (LMMs) with the fine-grained perceptual rigor of domain-specific expert models. Evaluating 20 video generation methods across diverse paradigms, we find that current models--despite strong visual fidelity--primarily behave as visual interpolators rather than true world models. We further validate the reliability of our benchmark by demonstrating a state-of-the-art Spearman's rank correlation of 94.4% with human judgments. Finally, MSVBench extends beyond evaluation by providing a scalable supervisory signal. Fine-tuning a lightweight model on its pipeline-refined reasoning traces yields human-aligned performance comparable to commercial models like Gemini-2.5-Flash.
Problem

Research questions and friction points this paper is trying to address.

multi-shot video generation
video evaluation
long-form coherence
narrative video
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

MSVBench
multi-shot video generation
hybrid evaluation framework
Large Multimodal Models
human-aligned supervision
🔎 Similar Papers
No similar papers found.