🤖 AI Summary
Existing TTS evaluation metrics (e.g., WER, F0-RMSE) suffer from limited dimensionality, weak correlation with human perceptual judgments, and heavy reliance on reference audio. To address these limitations, we propose TTScore—the first reference-free, dual-path evaluation framework based on discrete speech tokens. TTScore employs a text-conditioned sequence-to-sequence model to jointly predict two distinct token sequences: content tokens (capturing intelligibility) and prosody tokens (encoding intonation, rhythm, and other prosodic attributes), thereby enabling decoupled, fine-grained, and interpretable modeling of intelligibility and prosody. By eliminating dependence on reference speech, TTScore introduces, for the first time, a collaborative dual-sequence predictor architecture to model intrinsic speech properties. Extensive experiments on three major benchmarks—SOMOS, VoiceMOS, and TTSArena—demonstrate that TTScore significantly outperforms existing metrics, achieving 12–28% improvements in Pearson and Spearman correlations with human subjective ratings.
📝 Abstract
Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.