AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

πŸ“… 2026-04-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing text-to-audiovisual generation models lack a unified, fine-grained evaluation framework, making it difficult to assess multimodal semantic consistency in real-world scenarios. To address this gap, this work proposes the first task-oriented, multi-granularity benchmark, comprising 11 categories of high-quality, realistic prompts, and introduces an automatic evaluation framework that integrates lightweight expert models with multimodal large language models (MLLMs) to comprehensively assess generation qualityβ€”from perceptual fidelity to semantic controllability. Experimental results reveal that while current models exhibit strong audiovisual aesthetics, they suffer from systematic deficiencies in critical dimensions such as text rendering, speech coherence, physical reasoning, and musical pitch control, highlighting significant shortcomings in semantic reliability.
πŸ“ Abstract
Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Audio-Video generation
evaluation benchmark
multi-granular evaluation
semantic reliability
audio-visual generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-to-audio-video generation
multi-granular evaluation
task-driven benchmark
multimodal large language models
semantic controllability