XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark

📅 2025-05-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio deepfake detectors achieve near-perfect accuracy under in-domain evaluation but suffer severe generalization degradation across languages, speakers, and generative models—rendering them impractical for real-world deployment. To address this, we introduce XMAD-Bench, the first large-scale cross-domain multilingual benchmark for audio deepfake detection, comprising 668.8 hours of speech spanning Chinese, English, Japanese, Korean, French, and Spanish. Crucially, we propose a rigorous three-way separation protocol for data partitioning and evaluation—ensuring strict speaker-, model-, and source-domain disjointness. Extensive experiments reveal that state-of-the-art detectors attain ~100% in-domain accuracy, yet their cross-domain accuracy collapses to ~50%, effectively at chance level, exposing a critical generalization bottleneck. XMAD-Bench is publicly released to serve as a standardized, challenging evaluation platform for advancing robust, generalizable audio deepfake detection research.

Technology Category

Application Category

📝 Abstract
Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested ``in the wild''. Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad-bench/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating audio deepfake detectors in cross-domain multilingual scenarios
Addressing performance disparity between in-domain and cross-domain detection
Developing robust detectors for diverse languages, speakers, and generative methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-domain multilingual audio deepfake benchmark
Distinct speakers, methods, sources across splits
Publicly released large-scale dataset for evaluation
🔎 Similar Papers
No similar papers found.
Ioan-Paul Ciobanu
Ioan-Paul Ciobanu
Unknown affiliation
A
A. Hiji
Department of Computer Science, University of Bucharest, Bucharest, Romania
N
Nicolae-Cătălin Ristea
Department of Computer Science, University of Bucharest, Bucharest, Romania
Paul Irofti
Paul Irofti
Associate Professor, University of Bucharest
anomaly detectionCyberAIsecuritydictionary learningoperating systems
C
Cristian Rusu
Department of Computer Science, University of Bucharest, Bucharest, Romania
R
R. Ionescu
Department of Computer Science, University of Bucharest, Bucharest, Romania