Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper addresses key limitations in subjective, unidimensional, and context- or style-insufficient human-likeness evaluation for Chinese text-to-speech (TTS). We propose ATT, the first multidimensional evaluation framework grounded in the Turing test paradigm. Our contributions are threefold: (1) ATT-Corpus—the first Chinese TTS benchmark encompassing diverse contexts, speaker styles, and adversarial trap utterances; (2) a human discrimination-based evaluation protocol; and (3) Auto-ATT, an automated evaluator fine-tuned from Qwen2-Audio-Instruct, integrating human preference supervision and multi-dimensional speech sampling. Experiments demonstrate that ATT enables fine-grained differentiation of TTS models across naturalness, emotional expressiveness, and stylistic controllability. Auto-ATT achieves high agreement with human ratings (Spearman ρ > 0.92), improves evaluation efficiency by two orders of magnitude, and is publicly released on Hugging Face.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).

Problem

Research questions and friction points this paper is trying to address.

Evaluating human-likeness of Chinese LLM-based TTS systems

Addressing subjectivity and limitations in MOS-based TTS evaluation

Developing multi-dimensional corpus and automatic evaluation for Chinese TTS

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Audio Turing Test for human-likeness evaluation

Uses multi-dimensional Chinese corpus ATT-Corpus

Finetunes Qwen2-Audio-Instruct as Auto-ATT

🔎 Similar Papers

Prosody Analysis of Audiobooks

2023-10-10arXiv.orgCitations: 0

💼 Related Jobs

AI Language Engineer

Cresta

$90,000–$160,000 + Offers Equity

United States (Remote) / US (Remote)

Research Scientist Intern, Multimodal AI (PhD)