Why Speech Deepfake Detectors Won't Generalize: The Limits of Detection in an Open World

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Deepfake speech detection systems exhibit poor generalization in open-world settings, failing to adapt to dynamic variations in recording devices, sampling rates, codecs, acoustic environments, and attack families—resulting in incomplete detection coverage and critical security vulnerabilities. To characterize this limitation, we introduce the novel concept of *coverage debt*, formally modeling how detection blind spots expand as deployment scenarios evolve. We design a cross-test evaluation framework that groups test data by authentic speech domain and deepfake release year, revealing that modern synthesizers eliminate classical artifacts and that dialogue-heavy scenarios (e.g., remote meetings, social media) suffer the worst robustness degradation. Crucially, we demonstrate that mean performance metrics obscure severe worst-case failures. Our core contribution is a paradigm shift: rejecting reliance on monolithic detectors in favor of a layered defense strategy integrating provenance verification and identity-based credentialing—providing both theoretical foundations and actionable guidelines for high-assurance applications.

Technology Category

Application Category

📝 Abstract

Speech deepfake detectors are often evaluated on clean, benchmark-style conditions, but deployment occurs in an open world of shifting devices, sampling rates, codecs, environments, and attack families. This creates a ``coverage debt" for AI-based detectors: every new condition multiplies with existing ones, producing data blind spots that grow faster than data can be collected. Because attackers can target these uncovered regions, worst-case performance (not average benchmark scores) determines security. To demonstrate the impact of the coverage debt problem, we analyze results from a recent cross-testing framework. Grouping performance by bona fide domain and spoof release year, two patterns emerge: newer synthesizers erase the legacy artifacts detectors rely on, and conversational speech domains (teleconferencing, interviews, social media) are consistently the hardest to secure. These findings show that detection alone should not be relied upon for high-stakes decisions. Detectors should be treated as auxiliary signals within layered defenses that include provenance, personhood credentials, and policy safeguards.

Problem

Research questions and friction points this paper is trying to address.

Speech deepfake detectors fail to generalize across diverse real-world conditions

Newer speech synthesizers eliminate legacy artifacts that detectors depend on

Conversational speech domains remain consistently vulnerable to spoofing attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing cross-testing framework results for coverage gaps

Identifying newer synthesizers that erase legacy detection artifacts

Proposing layered defenses beyond standalone detection systems

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

2024-09-23arXiv.orgCitations: 1

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal AI (PhD)