CodecMOS-Accent: A MOS Benchmark of Resynthesized and TTS Speech from Neural Codecs Across English Accents

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the absence of human-centered subjective evaluation benchmarks for accented English speech in current neural audio codecs and large language model–driven text-to-speech (TTS) systems. The authors construct the first joint subjective evaluation dataset for multi-accent English, encompassing 10 accents and 4,000 samples, with 25 listeners providing Mean Opinion Scores (MOS) on naturalness, speaker similarity, and accent similarity, yielding 19,600 ratings. The findings reveal a strong correlation between speaker and accent similarity, highlight the limited predictive power of existing objective metrics, and uncover perceptual biases arising from listener–speaker accent alignment. This work establishes a human-centric foundation for evaluating accented speech synthesis and advances the development of more equitable and robust speech technologies.

Technology Category

Application Category

📝 Abstract
We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. The dataset comprises 4,000 codec resynthesis and TTS samples from 24 systems, featuring 32 speakers spanning ten accents. A large-scale subjective test was conducted to collect 19,600 annotations from 25 listeners across three dimensions: naturalness, speaker similarity, and accent similarity. This dataset does not only represent an up-to-date study of recent speech synthesis system performance but reveals insights including a tight relationship between speaker and accent similarity, the predictive power of objective metrics, and a perceptual bias when listeners share the same accent with the speaker. This dataset is expected to foster research on more human-centric evaluation for NAC and accented TTS.
Problem

Research questions and friction points this paper is trying to address.

neural audio codec
text-to-speech
accented speech
mean opinion score
subjective evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

neural audio codec
accented speech
subjective evaluation
text-to-speech
MOS benchmark
🔎 Similar Papers
No similar papers found.