MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

235K/year
🤖 AI Summary
Existing audiovisual generation evaluation frameworks suffer from limited data diversity, insufficient task coverage, and inadequate flexibility, hindering their applicability to complex, multi-shot narrative scenarios. This work proposes the first comprehensive benchmark and adaptive hybrid evaluation framework tailored for multi-shot audiovisual generation, encompassing four dimensions—video, audio, shot structure, and reference alignment—and supporting up to 15 shots, including non-photorealistic content. The framework introduces novel components: self-correcting shot segmentation, instantiated scoring rules, and a tool-driven evidence extraction mechanism. It achieves a Spearman rank correlation of 91.5% with human judgments and systematically evaluates 19 state-of-the-art models, revealing that modular or agent-based generation pipelines can effectively narrow the performance gap between open- and closed-source models while exposing critical limitations in director-level control and fine-grained audiovisual synchronization.
📝 Abstract
Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

multi-shot audio-video generation
evaluation benchmark
audio-visual synchronization
video generation
model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-shot audio-video generation
adaptive evaluation framework
shot segmentation self-correction
tool-grounded evidence extraction
human-aligned benchmarking
🔎 Similar Papers
No similar papers found.
Y
Yujie Wei
Fudan University
Yujin Han
Yujin Han
The University of Hong Kong
Machine LearningGenerative Model
Z
Zhekai Chen
The University of Hong Kong
Y
Yongming Li
Fudan University
Kaixun Jiang
Kaixun Jiang
Fudan University
Computer VisionAdversarial Examples
Z
Zhihang Liu
Tongyi Lab, Alibaba Group
Quanhao Li
Quanhao Li
Fudan University
Computer visionVideo Generation
Zhiwu Qing
Zhiwu Qing
Huazhong University of Science and Technology
Video Understanding
X
Xiang Wang
Tongyi Lab, Alibaba Group
Zhen Xing
Zhen Xing
Alibaba Tongyi Lab | Zhejiang University | Fudan University
Computer VisionVideo GenerationAIGCVideo Diffusion
Ruihang Chu
Ruihang Chu
Tsinghua University, CUHK, Wan
Generative AIVision-Language ModelComputer Vision
Lingyi Hong
Lingyi Hong
Fudan University
Computer Vision
Yefei He
Yefei He
Zhejiang University
Computer VisionAutoregressive Visual GenerationModel Quantization
Junjie Zhou
Junjie Zhou
Nanjing University
Computer VisionMachine Learning
J
Junqiu Yu
Fudan University
Yang Shi
Yang Shi
Peking University
Multimodal LearningCausal InferenceReinforcement Learning
Difan Zou
Difan Zou
The University of Hong Kong
Machine LearningDeep LearningOptimizationStochastic AlgorithmsSignal Processing
K
Kai Zhu
Tongyi Lab, Alibaba Group
Shiwei Zhang
Shiwei Zhang
Alibaba Group
Video UnderstandingVideo Generation
Y
Yingya Zhang
Tongyi Lab, Alibaba Group
Yu Liu
Yu Liu
Alibaba Group
self-supervised learninggenerative modeling
Xihui Liu
Xihui Liu
University of Hong Kong, UC Berkeley, CUHK, Tsinghua University
Computer VisionDeep Learning
Hongming Shan
Hongming Shan
Fudan University; Rensselaer Polytechnic institute
Machine LearningMedical ImagingComputer Vision