HQ-MPSD: A Multilingual Artifact-Controlled Benchmark for Partial Deepfake Speech Detection

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing deepfake speech detection methods suffer from dataset limitations—most rely on outdated synthesis techniques that introduce non-realistic artifacts, compromising detection authenticity and generalizability. Method: We introduce HQ-MPSD, the first high-quality, multilingual, partially spoofed speech detection benchmark, specifically designed for short-duration local manipulations. It employs a fine-grained forced-alignment–based semantic-prosodic continuous splicing method to eliminate synthetic artifacts, incorporating real background noise across eight languages, 550 speakers, and 350.8 hours of speech. Contribution/Results: HQ-MPSD is the first benchmark achieving multilinguality, high naturalness, and minimal artificial artifacts in partial deepfake speech. Experiments show state-of-the-art detectors suffer over 80% performance degradation on HQ-MPSD. Subjective MOS scores exceed 4.2, and spectrogram analysis reveals no discernible low-level artifacts—significantly enhancing realism and generalization challenges for detection models.

Technology Category

Application Category

📝 Abstract

Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: https://zenodo.org/records/17929533.

Problem

Research questions and friction points this paper is trying to address.

Detects partial deepfake speech with short manipulated regions

Addresses dataset limitations from outdated synthesis artifacts

Introduces multilingual realistic benchmark for generalization challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses linguistically coherent splice points from forced alignment

Adds background effects to simulate real-world acoustic conditions

Benchmarks models with cross-language and cross-dataset evaluations

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

2024-09-23arXiv.orgCitations: 1

Audio Anti-Spoofing Detection: A Survey

2024-04-22arXiv.orgCitations: 25