CodecMOS-Accent: A MOS Benchmark of Resynthesized and TTS Speech from Neural Codecs Across English Accents

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses the absence of human-centered subjective evaluation benchmarks for accented English speech in current neural audio codecs and large language model–driven text-to-speech (TTS) systems. The authors construct the first joint subjective evaluation dataset for multi-accent English, encompassing 10 accents and 4,000 samples, with 25 listeners providing Mean Opinion Scores (MOS) on naturalness, speaker similarity, and accent similarity, yielding 19,600 ratings. The findings reveal a strong correlation between speaker and accent similarity, highlight the limited predictive power of existing objective metrics, and uncover perceptual biases arising from listener–speaker accent alignment. This work establishes a human-centric foundation for evaluating accented speech synthesis and advances the development of more equitable and robust speech technologies.

Technology Category

Application Category

📝 Abstract

We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. The dataset comprises 4,000 codec resynthesis and TTS samples from 24 systems, featuring 32 speakers spanning ten accents. A large-scale subjective test was conducted to collect 19,600 annotations from 25 listeners across three dimensions: naturalness, speaker similarity, and accent similarity. This dataset does not only represent an up-to-date study of recent speech synthesis system performance but reveals insights including a tight relationship between speaker and accent similarity, the predictive power of objective metrics, and a perceptual bias when listeners share the same accent with the speaker. This dataset is expected to foster research on more human-centric evaluation for NAC and accented TTS.

Problem

Research questions and friction points this paper is trying to address.

neural audio codec

text-to-speech

accented speech

mean opinion score

subjective evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

neural audio codec

accented speech

subjective evaluation

text-to-speech

MOS benchmark

🔎 Similar Papers

MAD Speech: Measures of Acoustic Diversity of Speech

2024-04-16arXiv.orgCitations: 1