🤖 AI Summary
This study reveals a significant language dependency in audio DeepFake detection models: state-of-the-art English-pretrained models exhibit severe generalization failure on non-English languages.
Method: To address this, we introduce the first multilingual benchmark for audio DeepFake detection, systematically evaluating cross-lingual transfer performance across 12 languages. Building upon mainstream detection architectures, we compare cross-lingual adaptation strategies—including fine-tuning, feature alignment, and data augmentation—while rigorously controlling target-language data volume to quantify its impact.
Contribution/Results: Experiments demonstrate that (1) detection accuracy varies substantially across languages; (2) even minimal target-language data yields substantial accuracy gains; and (3) language-aware modeling is critical for multilingual robustness. This work provides the first empirical evidence of fundamental language bias in audio DeepFake detection, establishing both theoretical foundations and practical pathways toward truly multilingual-robust detection systems.
📝 Abstract
Since the majority of audio DeepFake (DF) detection methods are trained on English-centric datasets, their applicability to non-English languages remains largely unexplored. In this work, we present a benchmark for the multilingual audio DF detection challenge by evaluating various adaptation strategies. Our experiments focus on analyzing models trained on English benchmark datasets, as well as intra-linguistic (same-language) and cross-linguistic adaptation approaches. Our results indicate considerable variations in detection efficacy, highlighting the difficulties of multilingual settings. We show that limiting the dataset to English negatively impacts the efficacy, while stressing the importance of the data in the target language.