MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation metrics struggle to capture cinematic expressiveness in multi-character audiovisual generation, particularly high-level qualities such as character performance coherence and narrative atmosphere. To address this gap, this work introduces the first multidimensional taxonomy of high-level failure modes tailored to short-form cinematic scenarios, encompassing performance, narrative, atmosphere, and audiovisual language. The authors further construct a benchmark comprising over 10,000 structured question-answer pairs, enabling both scene-level assessment and temporal localization of failures. Experimental results demonstrate that even state-of-the-art multimodal large models like Gemini, while achieving the best performance on this benchmark, still fail to reliably identify complex cinematic expressiveness failures—thereby validating the benchmark’s effectiveness and inherent challenge.
📝 Abstract
In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.
Problem

Research questions and friction points this paper is trying to address.

cinematic expressiveness
multi-talker audio-video generation
failure modes
scene-level generation
audio-visual language
Innovation

Methods, ideas, or system contributions that make the work stand out.

cinematic expressiveness
multi-talker audio-video generation
failure mode diagnosis
scene-level evaluation
audio-visual language
🔎 Similar Papers
No similar papers found.
H
Haitian Li
Shanghai University
Y
Yanghao Zhou
Beijing Institute of Technology
H
Heyan Huang
Beijing Institute of Technology
L
Liangji Chen
Shanghai Film Academy
YiMing Cheng
YiMing Cheng
tsinghua university
llm ai
X
Xu Liu
Hefei University of Technology
D
Dian Jin
Hefei University of Technology
J
Jiajun Xu
Inkeverse Group Limited
J
Jingyun Liao
Inkeverse Group Limited
Tian Lan
Tian Lan
北京理工大学
Large Language ModelEvaluation and Critique AbilityText GenerationMulti-Modal
Z
Ziqin Zhou
The University of Adelaide
Y
Yueying Liu
Beijing University of Technology
Yu Bai
Yu Bai
Beijing Academy of Artificial Intelligence
Multi-modal ModelsEmbodied AI
C
Changsen Yuan
Beijing University of Technology
J
Jinxing Zhou
OpenNLP Lab
Xian-Ling Mao
Xian-Ling Mao
Beijing Institute of Technology
Web Data MiningInformation ExtractionQA & DialogueTopic ModelingLearn to Hashing
X
Xuefeng Chen
Inkeverse Group Limited
Y
Yousheng Feng
Inkeverse Group Limited