MDD: a Mask Diffusion Detector to Protect Speaker Verification Systems from Adversarial Perturbations

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Speaker verification systems are vulnerable to adversarial perturbations, posing serious security risks in real-world deployment. To address this, we propose the first text-conditioned masked diffusion model for adversarial detection and purification—requiring neither adversarial training nor large-scale pretraining. Our method models the degradation-reconstruction process directly on mel-spectrograms: in the forward process, localized regions are progressively masked with noise; in the reverse process, denoising and reconstruction are guided by text semantics. This design jointly optimizes detection robustness and speech fidelity. Extensive experiments demonstrate that our approach significantly outperforms existing diffusion-based and neural codec methods across multiple benchmarks. After purification, speaker verification accuracy recovers to near-clean levels (average improvement >25%), achieving, for the first time, text-guided, lightweight, end-to-end trainable adversarial speech purification.

Technology Category

Application Category

📝 Abstract

Speaker verification systems are increasingly deployed in security-sensitive applications but remain highly vulnerable to adversarial perturbations. In this work, we propose the Mask Diffusion Detector (MDD), a novel adversarial detection and purification framework based on a extit{text-conditioned masked diffusion model}. During training, MDD applies partial masking to Mel-spectrograms and progressively adds noise through a forward diffusion process, simulating the degradation of clean speech features. A reverse process then reconstructs the clean representation conditioned on the input transcription. Unlike prior approaches, MDD does not require adversarial examples or large-scale pretraining. Experimental results show that MDD achieves strong adversarial detection performance and outperforms prior state-of-the-art methods, including both diffusion-based and neural codec-based approaches. Furthermore, MDD effectively purifies adversarially-manipulated speech, restoring speaker verification performance to levels close to those observed under clean conditions. These findings demonstrate the potential of diffusion-based masking strategies for secure and reliable speaker verification systems.

Problem

Research questions and friction points this paper is trying to address.

Detecting adversarial perturbations in speaker verification systems

Purifying adversarially manipulated speech to restore system performance

Protecting security-sensitive applications from speech-based adversarial attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask Diffusion Detector for adversarial detection

Text-conditioned masked diffusion model

Reconstructs clean speech without adversarial examples

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection