🤖 AI Summary
Existing SDD and SASV datasets suffer from two major limitations: (1) limited acoustic diversity—lacking realistic noise conditions and environmental variability—and (2) insufficient speaker coverage. To address these, we introduce SpoofCeleb, the first unified, real-world-oriented benchmark dataset for both Speaker Deception Detection (SDD) and Spoofing-Aware Speaker Verification (SASV). Built upon VoxCeleb1, SpoofCeleb automatically synthesizes 23 types of TTS-based spoofed speech across 1,251 speakers, comprising 2.5 million natural speech samples. We propose a novel co-modeling data structure and an end-to-end, fully automated TTS adaptation pipeline, enabling high-fidelity spoofed speech generation under diverse noise conditions for the first time. The dataset includes standardized train/val/test splits, formal evaluation protocols, and open-source baseline models. On both SDD and SASV tasks, SpoofCeleb establishes a reproducible new benchmark, significantly improving environmental robustness and speaker diversity.
📝 Abstract
This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, current datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Current SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. SpoofCeleb leverages a fully automated pipeline we developed that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We present the baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb.