A Data-Centric Approach to Generalizable Speech Deepfake Detection

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech deepfake detection (SDD) suffers from poor generalization across diverse forgery methods. This paper identifies, for the first time, data composition as the root cause of this generalization bottleneck and proposes DOSS—a data-centric diversity-optimized sampling framework comprising DOSS-Select (pruning-based sample selection) and DOSS-Weight (dynamic reweighting). Leveraging a 12k-hour multi-source, cross-domain speech corpus, we quantitatively model and empirically analyze how data scale, provenance, and generator diversity critically impact generalization. DOSS achieves state-of-the-art performance on public benchmarks and novel commercial API challenge sets—surpassing the full-data aggregation baseline using only 3% of the training data. It significantly improves detection accuracy while reducing both data and model complexity.

Technology Category

Application Category

📝 Abstract
Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.
Problem

Research questions and friction points this paper is trying to address.

Addresses generalization in speech deepfake detection
Explores data composition impact on detection models
Proposes strategies for optimizing diverse dataset aggregation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-centric approach for speech deepfake detection
Diversity-Optimized Sampling Strategy for mixing data
Large-scale empirical study on data scaling laws
🔎 Similar Papers
No similar papers found.
W
Wen Huang
Auditory Cognition and Computational Acoustics Lab, MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University, China; LunaLabs, China
Yuchen Mao
Yuchen Mao
Zhejiang University
Theoretical Computer Science
Yanmin Qian
Yanmin Qian
Professor, Shanghai Jiao Tong University
Speech and Language ProcessingSignal ProcessingMachine Learning