🤖 AI Summary
Existing TTS intelligibility evaluation over-relies on word-level metrics (e.g., WER), failing to capture real-world speech comprehension requirements. Method: We propose Spoken-Passage Multiple-Choice Question Answering (SP-MCQA), the first subjective evaluation framework targeting paragraph-level semantic understanding. It assesses fidelity of key information in synthetic speech via human-constructed, news-style benchmark data (SP-MCQA-Eval, 8.76 hours), using multiple-choice question answering. Contribution/Results: SP-MCQA reveals that low WER does not guarantee high intelligibility, exposing deep flaws in text normalization and phoneme accuracy. Experiments show state-of-the-art TTS models perform significantly worse on cognitive comprehension tasks than on word-level metrics. SP-MCQA demonstrates superior sensitivity and ecological validity, establishing a new paradigm for TTS evaluation—shifting focus from lexical accuracy to semantic intelligibility.
📝 Abstract
The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.