The State Of TTS: A Case Study with Human Fooling Rates

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Current text-to-speech (TTS) systems lack rigorous evaluation of their human-likeness—specifically, their capacity to *deceive* human listeners into perceiving synthetic speech as natural human speech. Traditional subjective metrics fail to quantify this anthropomorphic deception capability objectively. Method: We propose the Human Fraud Rate (HFR) as a novel, behaviorally grounded evaluation metric and conduct large-scale CMOS (Comparison Mean Opinion Score) and Turing-style human deception tests across diverse high-quality conversational speech datasets, enabling cross-model benchmarking under zero-shot and fine-tuned conditions. Contribution/Results: Our analysis reveals that leading commercial TTS models achieve HFRs approaching those of human speakers (~45%) in zero-shot settings, whereas most open-source models lag substantially. Fine-tuning improves HFR but does not fully close the performance gap. This work pioneers the systematic integration of HFR into TTS evaluation, advancing toward more ecologically valid, interaction-aware assessment frameworks for speech synthesis.

Technology Category

Application Category

📝 Abstract

While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.

Problem

Research questions and friction points this paper is trying to address.

Measure TTS deception rate via Human Fooling Rate (HFR)

Assess human parity claims under realistic deception testing

Compare commercial and open-source TTS naturalness gaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Human Fooling Rate (HFR) metric

Evaluates TTS models with deception testing

Highlights gap in open-source vs commercial TTS

🔎 Similar Papers

People are poorly equipped to detect AI-powered voice clones

2024-10-03arXiv.orgCitations: 1

💼 Related Jobs

Member of Technical Staff - Voice Model

xAI

$150,000 - $450,000 USD

Palo Alto, CA / Palo Alto, CA, Palo Alto, California, United States

Research Scientist Intern, Multimodal AI (PhD)