Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing TTS evaluation metrics (e.g., WER, F0-RMSE) suffer from limited dimensionality, weak correlation with human perceptual judgments, and heavy reliance on reference audio. To address these limitations, we propose TTScore—the first reference-free, dual-path evaluation framework based on discrete speech tokens. TTScore employs a text-conditioned sequence-to-sequence model to jointly predict two distinct token sequences: content tokens (capturing intelligibility) and prosody tokens (encoding intonation, rhythm, and other prosodic attributes), thereby enabling decoupled, fine-grained, and interpretable modeling of intelligibility and prosody. By eliminating dependence on reference speech, TTScore introduces, for the first time, a collaborative dual-sequence predictor architecture to model intrinsic speech properties. Extensive experiments on three major benchmarks—SOMOS, VoiceMOS, and TTSArena—demonstrate that TTScore significantly outperforms existing metrics, achieving 12–28% improvements in Pearson and Spearman correlations with human subjective ratings.

Technology Category

Application Category

📝 Abstract
Objective evaluation of synthesized speech is critical for advancing speech generation systems, yet existing metrics for intelligibility and prosody remain limited in scope and weakly correlated with human perception. Word Error Rate (WER) provides only a coarse text-based measure of intelligibility, while F0-RMSE and related pitch-based metrics offer a narrow, reference-dependent view of prosody. To address these limitations, we propose TTScore, a targeted and reference-free evaluation framework based on conditional prediction of discrete speech tokens. TTScore employs two sequence-to-sequence predictors conditioned on input text: TTScore-int, which measures intelligibility through content tokens, and TTScore-pro, which evaluates prosody through prosody tokens. For each synthesized utterance, the predictors compute the likelihood of the corresponding token sequences, yielding interpretable scores that capture alignment with intended linguistic content and prosodic structure. Experiments on the SOMOS, VoiceMOS, and TTSArena benchmarks demonstrate that TTScore-int and TTScore-pro provide reliable, aspect-specific evaluation and achieve stronger correlations with human judgments of overall quality than existing intelligibility and prosody-focused metrics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating synthesized speech intelligibility and prosody objectively
Overcoming limitations of existing metrics like WER and F0-RMSE
Developing reference-free evaluation framework using discrete token prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-free evaluation via discrete token prediction
Dual predictors for intelligibility and prosody assessment
Sequence likelihood scoring for interpretable metrics
🔎 Similar Papers
No similar papers found.