Tell me Habibi, is it Real or Fake?

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing deepfake detection research predominantly focuses on monolingual speech, rendering it inadequate for Arabic–English code-switching (CS)—a pervasive phenomenon in Arabic digital communication. To address this gap, we introduce ArEnAV, the first large-scale multimodal Arabic–English audiovisual deepfake dataset, comprising 387,000 videos (765+ hours), covering intra-sentential CS, dialectal variation, and monolingual Arabic samples. We propose a novel cross-lingual generation pipeline integrating four text-to-speech (TTS) systems and two lip-sync models to enable controllable, multilingual deepfake synthesis. Comprehensive evaluation of state-of-the-art detectors on ArEnAV reveals significant performance degradation under CS conditions; human evaluation further confirms CS as a critical factor undermining detector robustness. ArEnAV is publicly released on Hugging Face to foster multimodal deepfake detection research.

Technology Category

Application Category

📝 Abstract

Deepfake generation methods are evolving fast, making fake media harder to detect and raising serious societal concerns. Most deepfake detection and dataset creation research focuses on monolingual content, often overlooking the challenges of multilingual and code-switched speech, where multiple languages are mixed within the same discourse. Code-switching, especially between Arabic and English, is common in the Arab world and is widely used in digital communication. This linguistic mixing poses extra challenges for deepfake detection, as it can confuse models trained mostly on monolingual data. To address this, we introduce extbf{ArEnAV}, the first large-scale Arabic-English audio-visual deepfake dataset featuring intra-utterance code-switching, dialectal variation, and monolingual Arabic content. It extbf{contains 387k videos and over 765 hours of real and fake videos}. Our dataset is generated using a novel pipeline integrating four Text-To-Speech and two lip-sync models, enabling comprehensive analysis of multilingual multimodal deepfake detection. We benchmark our dataset against existing monolingual and multilingual datasets, state-of-the-art deepfake detection models, and a human evaluation, highlighting its potential to advance deepfake research. The dataset can be accessed href{https://huggingface.co/datasets/kartik060702/ArEnAV-Full}{here}.

Problem

Research questions and friction points this paper is trying to address.

Detecting deepfakes in multilingual and code-switched speech

Addressing gaps in Arabic-English audio-visual deepfake datasets

Improving deepfake detection for dialectal and code-switching content

Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale Arabic-English audio-visual dataset

Integrates four TTS and two lip-sync models

Focuses on multilingual and code-switched speech

🔎 Similar Papers

No similar papers found.