🤖 AI Summary
This paper addresses the lack of unified, fair, and reproducible evaluation standards for voice cloning TTS models by introducing VCTK-Bench—the first open-source voice cloning benchmark. Methodologically, it establishes an end-to-end automated evaluation framework quantifying three core dimensions: speaker similarity, naturalness, and robustness—incorporating ASR-/SSL-based speaker verification, MOS prediction models, adversarial sample generation, and cross-lingual generalization assessment. Key contributions include: (1) a standardized evaluation protocol; (2) a lightweight, open-source Python evaluation library; and (3) a dynamic, transparent, and continuously updated community leaderboard. Experiments across 12 state-of-the-art models demonstrate strong correlation between automatic scores and human MOS ratings (Spearman ρ = 0.92), significantly improving evaluation efficiency and reproducibility.
📝 Abstract
We present a novel benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper discusses design considerations and presents a detailed description of the evaluation procedure. The usage of the software library is explained, along with the organization of results on the leaderboard.