Pairwise Evaluation of Accent Similarity in Speech Synthesis

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address the weak assessment of accent similarity in text-to-speech synthesis—particularly the low subjective reliability and failure of objective metrics for underrepresented accents—this paper proposes a novel human–machine collaborative evaluation framework. First, it improves the XAB listening test by integrating listener disagreement modeling, text-guided evaluation, and stringent inter-annotator reliability filtering to establish a lightweight, high-reliability subjective paradigm. Second, it introduces phonetically grounded objective metrics: interpretable measures based on vowel formant distance and phoneme posteriorgram similarity. Experiments demonstrate that the refined subjective protocol significantly enhances statistical power—reducing required participants by 40%. The proposed objective metrics achieve strong correlation with human judgments (Spearman’s ρ > 0.82) and substantially outperform mainstream ASR-based metrics (e.g., WER) on low-resource accents. Crucially, this work provides the first systematic analysis revealing inherent biases in WER and similar metrics for accent similarity assessment.

Technology Category

Application Category

📝 Abstract

Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.

Problem

Research questions and friction points this paper is trying to address.

Enhancing subjective evaluation of accent similarity in speech synthesis

Improving objective metrics for accent generation assessment

Addressing limitations of Word Error Rate for underrepresented accents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Refined XAB test with transcriptions and highlighting

Pronunciation metrics using vowel formants and posteriorgrams

Combined accent, speaker similarity, and Mel Cepstral Distortion

🔎 Similar Papers

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion