π€ AI Summary
This work addresses the limitations of current automatic evaluation metrics for zero-shot text-to-speech (TTS) synthesis, which suffer from saturation and insufficient discriminative power among state-of-the-art systems, while subjective evaluations remain costly and poorly reproducible. To overcome this, the authors propose the Iterate to Differentiate (I2D) evaluation framework, which introduces an iterative synthesis mechanism that recursively uses a modelβs own generated speech as reference to amplify performance differences under distributional shift. By aggregating multi-round objective scores and incorporating metrics such as UTMOSv2, I2D significantly improves alignment with human judgments. Experiments on English, Mandarin, and emotional datasets demonstrate that I2D boosts system-level SRCC from 0.118 to 0.464, substantially enhancing the discriminability and reliability of zero-shot TTS evaluation.
π Abstract
Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model's own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that I2D enables more reliable automated evaluation for zero-shot TTS.