π€ AI Summary
This paper addresses key limitations in subjective, unidimensional, and context- or style-insufficient human-likeness evaluation for Chinese text-to-speech (TTS). We propose ATT, the first multidimensional evaluation framework grounded in the Turing test paradigm. Our contributions are threefold: (1) ATT-Corpusβthe first Chinese TTS benchmark encompassing diverse contexts, speaker styles, and adversarial trap utterances; (2) a human discrimination-based evaluation protocol; and (3) Auto-ATT, an automated evaluator fine-tuned from Qwen2-Audio-Instruct, integrating human preference supervision and multi-dimensional speech sampling. Experiments demonstrate that ATT enables fine-grained differentiation of TTS models across naturalness, emotional expressiveness, and stylistic controllability. Auto-ATT achieves high agreement with human ratings (Spearman Ο > 0.92), improves evaluation efficiency by two orders of magnitude, and is publicly released on Hugging Face.
π Abstract
Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).