On Deepfake Voice Detection - It's All in the Presentation

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deepfake speech detection exhibits poor generalization in real-world communication scenarios—particularly over telephone channels—due to domain mismatch between training data and operational conditions. Method: We propose a channel-aware paradigm for data construction and detection, prioritizing data authenticity over model scale. Integrating generative AI analysis, multi-type communication channel simulation (e.g., PSTN, VoIP), and rigorous robustness evaluation, we curate the first benchmark dataset and laboratory testbed that faithfully reflect real telephony environments. Contribution/Results: Our approach improves detection accuracy by 39% under realistic channel-corrupted settings and by 57% on real-world benchmarks. Empirical results demonstrate that enhancing data quality yields substantially greater performance gains than scaling model capacity—validating data-centric design as a reproducible, generalizable foundation for deepfake speech defense.

Technology Category

Application Category

📝 Abstract
While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and research methodologies led to systems that failed to generalize to real world application. The main reason is due to the difference between raw deepfake audio, and deepfake audio that has been presented through a communication channel, e.g. by phone. We propose a new framework for data creation and research methodology, allowing for the development of spoofing countermeasures that would be more effective in real-world scenarios. By following the guidelines outlined here we improved deepfake detection accuracy by 39% in more robust and realistic lab setups, and by 57% on a real-world benchmark. We also demonstrate how improvement in datasets would have a bigger impact on deepfake detection accuracy than the choice of larger SOTA models would over smaller models; that is, it would be more important for the scientific community to make greater investment on comprehensive data collection programs than to simply train larger models with higher computational demands.
Problem

Research questions and friction points this paper is trying to address.

Detecting deepfake voices fails in real-world communication channels
Current datasets lack realistic presentation through phones
Need better data collection over larger AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes a new framework for data creation methodology
Focuses on communication channel presentation differences
Improves detection accuracy in realistic scenarios
🔎 Similar Papers
No similar papers found.
Héctor Delgado
Héctor Delgado
Research Scientist, Microsoft
audio deepfake detectionpresentation attack detectiondeepfake detectionvoice biometrics
G
Giorgio Ramondetti
Microsoft
E
Emanuele Dalmasso
Microsoft
G
Gennady Karvitsky
Microsoft
D
Daniele Colibro
Microsoft
H
Haydar Talib
Microsoft