🤖 AI Summary
This paper addresses the robustness deficiency of audio deepfake detection models against unknown attacks by proposing DeePen—the first black-box penetration testing framework requiring no model access. Methodologically, it employs lightweight time-domain signal transformations—including time-stretching, echo injection, and pitch shifting—to systematically probe vulnerabilities without knowledge of model architecture or parameters. Key contributions include: (1) establishing the first black-box penetration testing paradigm specifically for audio deepfake detectors; (2) empirically demonstrating that generic signal transformations exhibit cross-model adversarial transferability; and (3) identifying certain attacks as inherently resistant to defensive distillation. Experiments show DeePen achieves up to 98.3% evasion success across multiple industrial and academic detection systems, confirming that simple, imperceptible perturbations can effectively bypass state-of-the-art detectors. The complete toolkit—including attack implementations and a standardized benchmark—is publicly released.
📝 Abstract
Deepfakes - manipulated or forged audio and video media - pose significant security risks to individuals, organizations, and society at large. To address these challenges, machine learning-based classifiers are commonly employed to detect deepfake content. In this paper, we assess the robustness of such classifiers through a systematic penetration testing methodology, which we introduce as DeePen. Our approach operates without prior knowledge of or access to the target deepfake detection models. Instead, it leverages a set of carefully selected signal processing modifications - referred to as attacks - to evaluate model vulnerabilities. Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective. We release all associated code.