🤖 AI Summary
Speaker verification systems are vulnerable to adversarial perturbations, posing serious security risks in real-world deployment. To address this, we propose the first text-conditioned masked diffusion model for adversarial detection and purification—requiring neither adversarial training nor large-scale pretraining. Our method models the degradation-reconstruction process directly on mel-spectrograms: in the forward process, localized regions are progressively masked with noise; in the reverse process, denoising and reconstruction are guided by text semantics. This design jointly optimizes detection robustness and speech fidelity. Extensive experiments demonstrate that our approach significantly outperforms existing diffusion-based and neural codec methods across multiple benchmarks. After purification, speaker verification accuracy recovers to near-clean levels (average improvement >25%), achieving, for the first time, text-guided, lightweight, end-to-end trainable adversarial speech purification.
📝 Abstract
Speaker verification systems are increasingly deployed in security-sensitive applications but remain highly vulnerable to adversarial perturbations. In this work, we propose the Mask Diffusion Detector (MDD), a novel adversarial detection and purification framework based on a extit{text-conditioned masked diffusion model}. During training, MDD applies partial masking to Mel-spectrograms and progressively adds noise through a forward diffusion process, simulating the degradation of clean speech features. A reverse process then reconstructs the clean representation conditioned on the input transcription. Unlike prior approaches, MDD does not require adversarial examples or large-scale pretraining. Experimental results show that MDD achieves strong adversarial detection performance and outperforms prior state-of-the-art methods, including both diffusion-based and neural codec-based approaches. Furthermore, MDD effectively purifies adversarially-manipulated speech, restoring speaker verification performance to levels close to those observed under clean conditions. These findings demonstrate the potential of diffusion-based masking strategies for secure and reliable speaker verification systems.