🤖 AI Summary
Existing audio-visual generation models lack fine-grained, human-aligned automated evaluation metrics, and general-purpose multimodal models often fail to accurately capture human perceptual judgments. To address this gap, this work proposes the first human-centric benchmark for evaluating audio-visual generation, introducing ten fine-grained evaluation dimensions. The authors generate large-scale preference data through controlled perturbations and train a dedicated evaluator based on preference learning and multimodal consistency modeling. This evaluator produces continuous scores with calibrated confidence estimates, significantly improving alignment with human judgments. The resulting automated framework not only enables high-quality data filtering but also serves as a differentiable reward signal for human feedback in reinforcement learning, facilitating efficient and reliable assessment of generative models.
📝 Abstract
Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).