AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Evaluating text-to-audio (TTA) generation quality faces challenges of high annotation cost and incomplete coverage by existing objective metrics. To address this, we introduce the first large-scale, multi-dimensional TTA evaluation dataset—comprising 4,200 samples and 126,000 expert and non-expert annotations—and pioneer a dual-perspective (expert + crowd) assessment paradigm. We further propose Qwen-DisQA, a multimodal scoring model that jointly encodes text prompts and audio waveforms to enable fine-grained, scalable, automated quality assessment. Experiments demonstrate that Qwen-DisQA achieves strong human alignment across semantic fidelity, audio quality, and naturalness (average Spearman’s ρ > 0.85), significantly outperforming baseline methods. Both the dataset and model are publicly released to foster community advancement.

Technology Category

Application Category

📝 Abstract

Text-to-audio (TTA) is rapidly advancing, with broad potential in virtual reality, accessibility, and creative media. However, evaluating TTA quality remains difficult: human ratings are costly and limited, while existing objective metrics capture only partial aspects of perceptual quality. To address this gap, we introduce AudioEval, the first large-scale TTA evaluation dataset, containing 4,200 audio samples from 24 systems with 126,000 ratings across five perceptual dimensions, annotated by both experts and non-experts. Based on this resource, we propose Qwen-DisQA, a multimodal scoring model that jointly processes text prompts and generated audio to predict human-like quality ratings. Experiments show its effectiveness in providing reliable and scalable evaluation. The dataset will be made publicly available to accelerate future research.

Problem

Research questions and friction points this paper is trying to address.

Automates evaluation of text-to-audio generation quality

Addresses limitations in human and existing objective assessments

Provides scalable multimodal scoring for perceptual audio dimensions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with expert and non-expert ratings

Multimodal model jointly processes text prompts and audio

Predicts human-like quality ratings across five dimensions

🔎 Similar Papers

MAD Speech: Measures of Acoustic Diversity of Speech

2024-04-16arXiv.orgCitations: 1