🤖 AI Summary
To address the weak assessment of accent similarity in text-to-speech synthesis—particularly the low subjective reliability and failure of objective metrics for underrepresented accents—this paper proposes a novel human–machine collaborative evaluation framework. First, it improves the XAB listening test by integrating listener disagreement modeling, text-guided evaluation, and stringent inter-annotator reliability filtering to establish a lightweight, high-reliability subjective paradigm. Second, it introduces phonetically grounded objective metrics: interpretable measures based on vowel formant distance and phoneme posteriorgram similarity. Experiments demonstrate that the refined subjective protocol significantly enhances statistical power—reducing required participants by 40%. The proposed objective metrics achieve strong correlation with human judgments (Spearman’s ρ > 0.82) and substantially outperform mainstream ASR-based metrics (e.g., WER) on low-resource accents. Crucially, this work provides the first systematic analysis revealing inherent biases in WER and similar metrics for accent similarity assessment.
📝 Abstract
Despite growing interest in generating high-fidelity accents, evaluating accent similarity in speech synthesis has been underexplored. We aim to enhance both subjective and objective evaluation methods for accent similarity. Subjectively, we refine the XAB listening test by adding components that achieve higher statistical significance with fewer listeners and lower costs. Our method involves providing listeners with transcriptions, having them highlight perceived accent differences, and implementing meticulous screening for reliability. Objectively, we utilise pronunciation-related metrics, based on distances between vowel formants and phonetic posteriorgrams, to evaluate accent generation. Comparative experiments reveal that these metrics, alongside accent similarity, speaker similarity, and Mel Cepstral Distortion, can be used. Moreover, our findings underscore significant limitations of common metrics like Word Error Rate in assessing underrepresented accents.