π€ AI Summary
AI-based voice synthesis poses significant risks of voiceprint theft and spoofing, yet existing time-frequency-domain adversarial defenses are vulnerable to speech enhancement techniques. Method: We propose a novel embedding-space robust adversarial defense paradigm: gradient-guided adversarial perturbations are injected into the semantic embedding layer of pre-trained speech encoders (e.g., Wav2Vec 2.0), followed by differentiable reconstruction to generate high-fidelity protected speech. Contribution/Results: This is the first work to design adversarial perturbations directly at the embedding level, overcoming the fragility bottleneck of conventional signal-domain defenses while preserving both robustness and speech naturalness. Our method achieves over 70% improvement in defense success rate across four state-of-the-art TTS models and attains a 99.5% defense rate against commercial voiceprint APIsβeven under strong speech enhancement attacks. User studies confirm its perceptual naturalness and practical utility.
π Abstract
With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others' voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhancement methods. To overcome this limitation, we propose RoVo (Robust Voice), a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, reconstructing them into protected speech. This approach effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models, which represent a secondary attack threat. In extensive experiments, RoVo increased the Defense Success Rate (DSR) by over 70% compared to unprotected speech, across four state-of-the-art speech synthesis models. Specifically, RoVo achieved a DSR of 99.5% on a commercial speaker-verification API, effectively neutralizing speech synthesis attack. Moreover, RoVo's perturbations remained robust even under strong speech enhancement conditions, outperforming traditional methods. A user study confirmed that RoVo preserves both naturalness and usability of protected speech, highlighting its effectiveness in complex and evolving threat scenarios.