HQ-MPSD: A Multilingual Artifact-Controlled Benchmark for Partial Deepfake Speech Detection

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing deepfake speech detection methods suffer from dataset limitations—most rely on outdated synthesis techniques that introduce non-realistic artifacts, compromising detection authenticity and generalizability. Method: We introduce HQ-MPSD, the first high-quality, multilingual, partially spoofed speech detection benchmark, specifically designed for short-duration local manipulations. It employs a fine-grained forced-alignment–based semantic-prosodic continuous splicing method to eliminate synthetic artifacts, incorporating real background noise across eight languages, 550 speakers, and 350.8 hours of speech. Contribution/Results: HQ-MPSD is the first benchmark achieving multilinguality, high naturalness, and minimal artificial artifacts in partial deepfake speech. Experiments show state-of-the-art detectors suffer over 80% performance degradation on HQ-MPSD. Subjective MOS scores exceed 4.2, and spectrogram analysis reveals no discernible low-level artifacts—significantly enhancing realism and generalization challenges for detection models.

Technology Category

Application Category

📝 Abstract
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: https://zenodo.org/records/17929533.
Problem

Research questions and friction points this paper is trying to address.

Detects partial deepfake speech with short manipulated regions
Addresses dataset limitations from outdated synthesis artifacts
Introduces multilingual realistic benchmark for generalization challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses linguistically coherent splice points from forced alignment
Adds background effects to simulate real-world acoustic conditions
Benchmarks models with cross-language and cross-dataset evaluations
🔎 Similar Papers
No similar papers found.
Menglu Li
Menglu Li
Toronto Metropolitan University
Audio ProcessingDeep Learning
M
Majd Alber
Department of Electrical, Computer & Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada
R
Ramtin Asgarianamiri
Department of Electrical, Computer & Biomedical Engineering, Toronto Metropolitan University, Toronto, Canada
Lian Zhao
Lian Zhao
Toronto Metropolitan University
Resource managementIoV/IoT NetworksMobile Edge Computing
X
Xiao-Ping Zhang
Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua Shenzhen International Graduate School, Tsinghua University