TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Current TTS evaluation faces three key challenges: (1) subjective metrics (e.g., MOS) lack cross-study comparability; (2) objective metrics exhibit weak correlation with human judgments; and (3) distinguishing high-fidelity synthetic speech from natural speech remains difficult. To address these, we propose TTSDS2—an automatic evaluation metric that achieves, uniquely among 16 baselines, Spearman correlations >0.50 across all domains and subjective dimensions. We further introduce a dynamic, updatable multilingual TTS benchmark platform covering 14 languages, integrating a multilingual test-data reconstruction pipeline and a data-leakage-proof evaluation framework. Additionally, we release a large-scale, manually annotated multilingual subjective dataset comprising over 11,000 utterance-level ratings. This work significantly enhances the reliability, reproducibility, and cross-lingual generalizability of TTS evaluation.

Technology Category

Application Category

📝 Abstract

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.

Problem

Research questions and friction points this paper is trying to address.

Evaluating TTS systems lacks reliable, comparable metrics

Existing metrics struggle with human-like synthetic speech

Need robust multilingual resources for TTS evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TTSDS2 for robust TTS evaluation

Provides multilingual dataset with 11,000 ratings

Offers pipeline for avoiding data leakage

🔎 Similar Papers

Prosody Analysis of Audiobooks

2023-10-10arXiv.orgCitations: 0