ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Research on spoof speech detection for Arabic multi-dialects lags significantly, hindered by the absence of high-quality benchmark datasets and systematic evaluation protocols. Method: We introduce ArabSpoof-MS—the first multi-regional Arabic spoof speech dataset—incorporating human MOS ratings and ASR-WER analysis to comprehensively assess the naturalness and detectability of state-of-the-art TTS systems (e.g., FishSpeech). We evaluate detection performance using MFCC-based traditional classifiers, embedding-feature classifiers, and RawNet2. Contribution/Results: Experiments reveal that FishSpeech produces the most natural-sounding synthetic speech; however, single-model synthesis compromises detector generalizability. Integrating multi-model synthesis into dataset construction substantially enhances robustness. This work fills a critical gap in Arabic speech security research and establishes a new multilingual benchmark and methodological foundation for spoof speech detection.

Technology Category

Application Category

📝 Abstract
With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.
Problem

Research questions and friction points this paper is trying to address.

Detecting synthetic Arabic speech across multiple dialects
Addressing limited spoof detection research for Arabic language
Evaluating TTS models for realistic Arabic voice cloning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-dialect Arabic spoofed speech dataset creation
Evaluation pipeline combining embedding-based and classical methods
Integrated human ratings and ASR metrics for assessment
🔎 Similar Papers
No similar papers found.
M
Mohamed Maged
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
A
Alhassan Ehab
Queen’s University, Canada
A
Ali Mekky
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
B
Besher Hassan
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Shady Shehata
Shady Shehata
University of Waterloo
Artificial IntelligenceNatural Language Processing