🤖 AI Summary
Existing deepfake detection research predominantly focuses on monolingual speech, rendering it inadequate for Arabic–English code-switching (CS)—a pervasive phenomenon in Arabic digital communication. To address this gap, we introduce ArEnAV, the first large-scale multimodal Arabic–English audiovisual deepfake dataset, comprising 387,000 videos (765+ hours), covering intra-sentential CS, dialectal variation, and monolingual Arabic samples. We propose a novel cross-lingual generation pipeline integrating four text-to-speech (TTS) systems and two lip-sync models to enable controllable, multilingual deepfake synthesis. Comprehensive evaluation of state-of-the-art detectors on ArEnAV reveals significant performance degradation under CS conditions; human evaluation further confirms CS as a critical factor undermining detector robustness. ArEnAV is publicly released on Hugging Face to foster multimodal deepfake detection research.
📝 Abstract
Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce extbf{ArEnAV}, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It extbf{contains 387k videos and over 765 hours of real and fake videos}. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research. The dataset can be accessed href{https://huggingface.co/datasets/kartik060702/ArEnAV-Full}{here}.