🤖 AI Summary
Evaluating text-to-audio (TTA) generation quality faces challenges of high annotation cost and incomplete coverage by existing objective metrics. To address this, we introduce the first large-scale, multi-dimensional TTA evaluation dataset—comprising 4,200 samples and 126,000 expert and non-expert annotations—and pioneer a dual-perspective (expert + crowd) assessment paradigm. We further propose Qwen-DisQA, a multimodal scoring model that jointly encodes text prompts and audio waveforms to enable fine-grained, scalable, automated quality assessment. Experiments demonstrate that Qwen-DisQA achieves strong human alignment across semantic fidelity, audio quality, and naturalness (average Spearman’s ρ > 0.85), significantly outperforming baseline methods. Both the dataset and model are publicly released to foster community advancement.
📝 Abstract
Text-to-audio (TTA) is rapidly advancing, with broad potential in virtual reality, accessibility, and creative media. However, evaluating TTA quality remains difficult: human ratings are costly and limited, while existing objective metrics capture only partial aspects of perceptual quality. To address this gap, we introduce AudioEval, the first large-scale TTA evaluation dataset, containing 4,200 audio samples from 24 systems with 126,000 ratings across five perceptual dimensions, annotated by both experts and non-experts. Based on this resource, we propose Qwen-DisQA, a multimodal scoring model that jointly processes text prompts and generated audio to predict human-like quality ratings. Experiments show its effectiveness in providing reliable and scalable evaluation. The dataset will be made publicly available to accelerate future research.