🤖 AI Summary
Current text-to-speech (TTS) systems lack rigorous evaluation of their human-likeness—specifically, their capacity to *deceive* human listeners into perceiving synthetic speech as natural human speech. Traditional subjective metrics fail to quantify this anthropomorphic deception capability objectively.
Method: We propose the Human Fraud Rate (HFR) as a novel, behaviorally grounded evaluation metric and conduct large-scale CMOS (Comparison Mean Opinion Score) and Turing-style human deception tests across diverse high-quality conversational speech datasets, enabling cross-model benchmarking under zero-shot and fine-tuned conditions.
Contribution/Results: Our analysis reveals that leading commercial TTS models achieve HFRs approaching those of human speakers (~45%) in zero-shot settings, whereas most open-source models lag substantially. Fine-tuning improves HFR but does not fully close the performance gap. This work pioneers the systematic integration of HFR into TTS evaluation, advancing toward more ecologically valid, interaction-aware assessment frameworks for speech synthesis.
📝 Abstract
While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.