Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation

πŸ“… 2026-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of current automatic evaluation metrics for zero-shot text-to-speech (TTS) synthesis, which suffer from saturation and insufficient discriminative power among state-of-the-art systems, while subjective evaluations remain costly and poorly reproducible. To overcome this, the authors propose the Iterate to Differentiate (I2D) evaluation framework, which introduces an iterative synthesis mechanism that recursively uses a model’s own generated speech as reference to amplify performance differences under distributional shift. By aggregating multi-round objective scores and incorporating metrics such as UTMOSv2, I2D significantly improves alignment with human judgments. Experiments on English, Mandarin, and emotional datasets demonstrate that I2D boosts system-level SRCC from 0.118 to 0.464, substantially enhancing the discriminability and reliability of zero-shot TTS evaluation.

Technology Category

Application Category

πŸ“ Abstract
Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model's own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that I2D enables more reliable automated evaluation for zero-shot TTS.
Problem

Research questions and friction points this paper is trying to address.

zero-shot TTS
evaluation
objective metrics
discriminability
reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterate to Differentiate
zero-shot TTS evaluation
iterative synthesis
robustness
objective metric alignment
πŸ”Ž Similar Papers