VABench: A Comprehensive Benchmark for Audio-Video Generation

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation benchmarks lack effective evaluation of synchronized audio-visual generation. This paper introduces VABench, the first multi-dimensional benchmark specifically designed for synchronized audio-visual generation, covering three core tasks: text-to-audio-visual (T2AV), image-to-audio-visual (I2AV), and stereo-video generation. It incorporates 15 fine-grained metrics to systematically assess cross-modal alignment, audio-visual synchronization, lip-sync consistency, and audio-visual question answering (AV-QA). We propose a novel audio-video co-generation evaluation paradigm and establish a unified assessment framework spanning multiple dimensions, categories, and tasks—augmented by a real-scenario-driven AV-QA subset. Evaluation leverages CLIP/CLAP similarity, temporal alignment analysis, lip-motion–speech consistency detection, and customized QA pairs. Benchmarking 12 state-of-the-art models across seven content domains reveals two fundamental bottlenecks: insufficient synchronization fidelity and semantic inconsistency. VABench advances standardization in audio-visual generative evaluation.

Technology Category

Application Category

📝 Abstract
Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.
Problem

Research questions and friction points this paper is trying to address.

Evaluates audio-video generation synchronization and quality
Assesses models across multiple tasks and content categories
Establishes a benchmark for comprehensive multi-dimensional evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for audio-video generation evaluation
Multi-dimensional assessment across 15 specific dimensions
Covers diverse content categories and task types
🔎 Similar Papers
No similar papers found.
D
Daili Hua
Peking University
X
Xizhi Wang
Huazhong University of Science and Technology
Bohan Zeng
Bohan Zeng
PhD student, Peking University
Data-Centric AIComputer VisionDiffusion Model3D
X
Xinyi Huang
Peking University
H
Hao Liang
Peking University
Junbo Niu
Junbo Niu
Peking University
Foundation Model
X
Xinlong Chen
Institute of Automation, Chinese Academy of Sciences
Quanqing Xu
Quanqing Xu
Ant Group
Cloud ComputingCloud StorageLarge-scale Hybrid Storage Systems
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved