🤖 AI Summary
Existing audio deepfake detectors perform well on controlled benchmarks but exhibit severe robustness deficiencies in real-world scenarios. Method: We introduce P²V, the first practical-threat-oriented, high-fidelity detection benchmark—uniquely integrating LLM-generated identity-consistent utterances, realistic environmental noise, adversarial perturbations, and state-of-the-art voice cloning techniques deployed between 2020–2025, to construct a comprehensive forgery dataset spanning multidimensional challenges. Contribution/Results: Experiments show that current SOTA detectors suffer an average 43% performance drop on P²V, revealing critical generalization gaps. In contrast, models trained on P²V achieve significantly enhanced robustness against complex attacks while preserving strong generalization on prior benchmarks, demonstrating P²V’s pivotal role in advancing practically deployable deepfake detection.
📝 Abstract
Current audio deepfake detectors cannot be trusted. While they excel on controlled benchmarks, they fail when tested in the real world. We introduce Perturbed Public Voices (P$^{2}$V), an IRB-approved dataset capturing three critical aspects of malicious deepfakes: (1) identity-consistent transcripts via LLMs, (2) environmental and adversarial noise, and (3) state-of-the-art voice cloning (2020-2025). Experiments reveal alarming vulnerabilities of 22 recent audio deepfake detectors: models trained on current datasets lose 43% performance when tested on P$^{2}$V, with performance measured as the mean of F1 score on deepfake audio, AUC, and 1-EER. Simple adversarial perturbations induce up to 16% performance degradation, while advanced cloning techniques reduce detectability by 20-30%. In contrast, P$^{2}$V-trained models maintain robustness against these attacks while generalizing to existing datasets, establishing a new benchmark for robust audio deepfake detection. P$^{2}$V will be publicly released upon acceptance by a conference/journal.