The State Of TTS: A Case Study with Human Fooling Rates

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-speech (TTS) systems lack rigorous evaluation of their human-likeness—specifically, their capacity to *deceive* human listeners into perceiving synthetic speech as natural human speech. Traditional subjective metrics fail to quantify this anthropomorphic deception capability objectively. Method: We propose the Human Fraud Rate (HFR) as a novel, behaviorally grounded evaluation metric and conduct large-scale CMOS (Comparison Mean Opinion Score) and Turing-style human deception tests across diverse high-quality conversational speech datasets, enabling cross-model benchmarking under zero-shot and fine-tuned conditions. Contribution/Results: Our analysis reveals that leading commercial TTS models achieve HFRs approaching those of human speakers (~45%) in zero-shot settings, whereas most open-source models lag substantially. Fine-tuning improves HFR but does not fully close the performance gap. This work pioneers the systematic integration of HFR into TTS evaluation, advancing toward more ecologically valid, interaction-aware assessment frameworks for speech synthesis.

Technology Category

Application Category

📝 Abstract
While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.
Problem

Research questions and friction points this paper is trying to address.

Measure TTS deception rate via Human Fooling Rate (HFR)
Assess human parity claims under realistic deception testing
Compare commercial and open-source TTS naturalness gaps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Human Fooling Rate (HFR) metric
Evaluates TTS models with deception testing
Highlights gap in open-source vs commercial TTS
🔎 Similar Papers
No similar papers found.
P
Praveen Srinivasa Varadhan
AI4Bharat, Indian Institute of Technology Madras, India
S
Sherry Thomas
AI4Bharat, Indian Institute of Technology Madras, India
S
Sai Teja M. S.
AI4Bharat, Indian Institute of Technology Madras, India
Suvrat Bhooshan
Suvrat Bhooshan
Gan.ai, ex Facebook AI Research (FAIR)
Deep LearningComputer VisionMedical Imaging
M
Mitesh M. Khapra
AI4Bharat, Indian Institute of Technology Madras, India