DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reproducibility and comparability challenges in current speech deepfake detection research, which stem from the absence of standardized evaluation protocols. To this end, we present a unified, modular, and extensible open-source PyTorch toolkit that integrates diverse state-of-the-art model architectures, loss functions, data augmentation strategies, and pretrained front-end feature extractors, enabling flexible configuration and large-scale systematic evaluation. Through comprehensive benchmarking of over 400 training configurations, our study systematically demonstrates—for the first time—the dominant influence of pretrained front ends on detection performance and reveals significant biases in high-performing models across audio quality, gender, and language dimensions. Furthermore, we show that high-quality training data enhances cross-domain generalization and provide tools for equitable data selection and front-end fine-tuning to jointly improve fairness and robustness in deepfake detection systems.
📝 Abstract
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

deepfake audio detection
reproducibility
evaluation protocol
model bias
cross-domain generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

deepfake audio detection
modular framework
cross-domain generalization
bias analysis
feature extractor
🔎 Similar Papers
No similar papers found.
Yassine El Kheir
Yassine El Kheir
PhD Researcher, German Research Center for Artificial Intelligence (DFKI) & TU Berlin
Speech Deepfake DetectionSelf-Supervised LearningPronunciation Assessment
A
Arnab Das
German Research Center for Artificial Intelligence (DFKI), Germany
Y
Yixuan Xiao
University of Stuttgart, Germany
Xin Wang
Xin Wang
National Institute of Informatics
Speech synthesisSpeech SecurityAudio signal ProcessingMachine learning
F
Feidi Kallel
German Research Center for Artificial Intelligence (DFKI), Germany
E
Enes Erdem Erdogan
German Research Center for Artificial Intelligence (DFKI), Germany
N
Ngoc Thang Vu
University of Stuttgart, Germany
Tim Polzehl
Tim Polzehl
German Research Center for Artificial Intelligence
Speech and Language technology
S
Sebastian Moeller
Technical University of Berlin, Germany