SpoofCeleb: Speech Deepfake Detection and SASV in the Wild

📅 2024-09-18
🏛️ IEEE Open Journal of Signal Processing
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing SDD and SASV datasets suffer from two major limitations: (1) limited acoustic diversity—lacking realistic noise conditions and environmental variability—and (2) insufficient speaker coverage. To address these, we introduce SpoofCeleb, the first unified, real-world-oriented benchmark dataset for both Speaker Deception Detection (SDD) and Spoofing-Aware Speaker Verification (SASV). Built upon VoxCeleb1, SpoofCeleb automatically synthesizes 23 types of TTS-based spoofed speech across 1,251 speakers, comprising 2.5 million natural speech samples. We propose a novel co-modeling data structure and an end-to-end, fully automated TTS adaptation pipeline, enabling high-fidelity spoofed speech generation under diverse noise conditions for the first time. The dataset includes standardized train/val/test splits, formal evaluation protocols, and open-source baseline models. On both SDD and SASV tasks, SpoofCeleb establishes a reproducible new benchmark, significantly improving environmental robustness and speaker diversity.

Technology Category

Application Category

📝 Abstract
This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, current datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Current SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. SpoofCeleb leverages a fully automated pipeline we developed that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. SpoofCeleb comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We present the baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb.
Problem

Research questions and friction points this paper is trying to address.

Develops dataset for detecting speech deepfakes in real-world conditions
Addresses spoofing-robust speaker verification with diverse acoustic environments
Provides automated pipeline to transform data for TTS training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline processes VoxCeleb1 dataset
Trains 23 contemporary TTS systems
Includes 2.5M utterances from 1,251 speakers
🔎 Similar Papers
No similar papers found.