Pairwise Evaluation of Accent Similarity in Speech Synthesis

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak assessment of accent similarity in text-to-speech synthesis—particularly the low subjective reliability and failure of objective metrics for underrepresented accents—this paper proposes a novel human–machine collaborative evaluation framework. First, it improves the XAB listening test by integrating listener disagreement modeling, text-guided evaluation, and stringent inter-annotator reliability filtering to establish a lightweight, high-reliability subjective paradigm. Second, it introduces phonetically grounded objective metrics: interpretable measures based on vowel formant distance and phoneme posteriorgram similarity. Experiments demonstrate that the refined subjective protocol significantly enhances statistical power—reducing required participants by 40%. The proposed objective metrics achieve strong correlation with human judgments (Spearman’s ρ > 0.82) and substantially outperform mainstream ASR-based metrics (e.g., WER) on low-resource accents. Crucially, this work provides the first systematic analysis revealing inherent biases in WER and similar metrics for accent similarity assessment.

Technology Category

Application Category

📝 Abstract
Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.
Problem

Research questions and friction points this paper is trying to address.

Enhancing subjective evaluation of accent similarity in speech synthesis
Improving objective metrics for accent generation assessment
Addressing limitations of Word Error Rate for underrepresented accents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Refined XAB test with transcriptions and highlighting
Pronunciation metrics using vowel formants and posteriorgrams
Combined accent, speaker similarity, and Mel Cepstral Distortion
🔎 Similar Papers
No similar papers found.
J
Jinzuomu Zhong
Centre for Speech Technology Research, University of Edinburgh, United Kingdom
Suyuan Liu
Suyuan Liu
National University of Defense Technology
Multi-view ClusteringAnchor LearningGraph Learning
D
Dan Wells
Centre for Speech Technology Research, University of Edinburgh, United Kingdom
Korin Richmond
Korin Richmond
Centre for Speech Technology Research, University of Edinburgh
Speech synthesisarticulatory modellingarticulatory-acoustic relationshiplexicography