🤖 AI Summary
This study addresses the absence of human-centered subjective evaluation benchmarks for accented English speech in current neural audio codecs and large language model–driven text-to-speech (TTS) systems. The authors construct the first joint subjective evaluation dataset for multi-accent English, encompassing 10 accents and 4,000 samples, with 25 listeners providing Mean Opinion Scores (MOS) on naturalness, speaker similarity, and accent similarity, yielding 19,600 ratings. The findings reveal a strong correlation between speaker and accent similarity, highlight the limited predictive power of existing objective metrics, and uncover perceptual biases arising from listener–speaker accent alignment. This work establishes a human-centric foundation for evaluating accented speech synthesis and advances the development of more equitable and robust speech technologies.
📝 Abstract
We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. The dataset comprises 4,000 codec resynthesis and TTS samples from 24 systems, featuring 32 speakers spanning ten accents. A large-scale subjective test was conducted to collect 19,600 annotations from 25 listeners across three dimensions: naturalness, speaker similarity, and accent similarity. This dataset does not only represent an up-to-date study of recent speech synthesis system performance but reveals insights including a tight relationship between speaker and accent similarity, the predictive power of objective metrics, and a perceptual bias when listeners share the same accent with the speaker. This dataset is expected to foster research on more human-centric evaluation for NAC and accented TTS.