AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating text-to-audio (TTA) generation quality faces challenges of high annotation cost and incomplete coverage by existing objective metrics. To address this, we introduce the first large-scale, multi-dimensional TTA evaluation dataset—comprising 4,200 samples and 126,000 expert and non-expert annotations—and pioneer a dual-perspective (expert + crowd) assessment paradigm. We further propose Qwen-DisQA, a multimodal scoring model that jointly encodes text prompts and audio waveforms to enable fine-grained, scalable, automated quality assessment. Experiments demonstrate that Qwen-DisQA achieves strong human alignment across semantic fidelity, audio quality, and naturalness (average Spearman’s ρ > 0.85), significantly outperforming baseline methods. Both the dataset and model are publicly released to foster community advancement.

Technology Category

Application Category

📝 Abstract
Text-to-audio (TTA) is rapidly advancing, with broad potential in virtual reality, accessibility, and creative media. However, evaluating TTA quality remains difficult: human ratings are costly and limited, while existing objective metrics capture only partial aspects of perceptual quality. To address this gap, we introduce AudioEval, the first large-scale TTA evaluation dataset, containing 4,200 audio samples from 24 systems with 126,000 ratings across five perceptual dimensions, annotated by both experts and non-experts. Based on this resource, we propose Qwen-DisQA, a multimodal scoring model that jointly processes text prompts and generated audio to predict human-like quality ratings. Experiments show its effectiveness in providing reliable and scalable evaluation. The dataset will be made publicly available to accelerate future research.
Problem

Research questions and friction points this paper is trying to address.

Automates evaluation of text-to-audio generation quality
Addresses limitations in human and existing objective assessments
Provides scalable multimodal scoring for perceptual audio dimensions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with expert and non-expert ratings
Multimodal model jointly processes text prompts and audio
Predicts human-like quality ratings across five dimensions
🔎 Similar Papers
No similar papers found.
H
Hui Wang
College of Computer Science, Nankai University, China
Jinghua Zhao
Jinghua Zhao
Nankai University
C
Cheng Liu
College of Computer Science, Nankai University, China
Y
Yuhang Jia
College of Computer Science, Nankai University, China
Haoqin Sun
Haoqin Sun
Nankai University
Affective computingSpeech signal processingAudio understanding
J
Jiaming Zhou
College of Computer Science, Nankai University, China
Y
Yong Qin
College of Computer Science, Nankai University, China