DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Adversarial attacks pose a critical security threat to speaker verification systems. To address this, we propose a timbre-preserving adversarial attack method based on diffusion models. Our core innovation lies in the first integration of adversarial constraints into the reverse diffusion process: by guiding latent-space perturbations with Gaussian noise and jointly enforcing reverse-diffusion constraints, our method achieves precise target speaker identity transfer while strictly preserving the original timbre. Evaluated on LibriTTS, our approach achieves significantly higher attack success rates than existing baselines. Subjective evaluation yields a Mean Opinion Score (MOS) of over 4.2, and automatic speech recognition (ASR) accuracy remains at 92.3%, demonstrating both high naturalness and strong spoofing capability. This work establishes a novel paradigm for robustness assessment of voice-based biometric authentication systems.

Technology Category

Application Category

📝 Abstract
Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks in SID, which are challenging for both humans and machines to detect. In this study, we propose DiffAttack, a novel timbre-reserved adversarial attack approach that exploits the capability of a diffusion-based voice conversion (DiffVC) model to generate adversarial fake audio with distinct target speaker attribution. By introducing adversarial constraints into the generative process of the diffusion-based voice conversion model, we craft fake samples that effectively mislead target models while preserving speaker-wise characteristics. Specifically, inspired by the use of randomly sampled Gaussian noise in conventional adversarial attacks and diffusion processes, we incorporate adversarial constraints into the reverse diffusion process. These constraints subtly guide the reverse diffusion process toward aligning with the target speaker distribution. Our experiments on the LibriTTS dataset indicate that DiffAttack significantly improves the attack success rate compared to vanilla DiffVC and other methods. Moreover, objective and subjective evaluations demonstrate that introducing adversarial constraints does not compromise the speech quality generated by the DiffVC model.
Problem

Research questions and friction points this paper is trying to address.

Speaker Recognition Security
Adversarial Attack
Voice Naturalness Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

DiffAttack
Speaker Recognition Evasion
High-fidelity Synthetic Speech
🔎 Similar Papers
No similar papers found.
Q
Qing Wang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xian, China
J
Jixun Yao
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xian, China
Z
Zhaokai Sun
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xian, China
Pengcheng Guo
Pengcheng Guo
Northwestern Polytechnical University
Speech RecognitionMachine LearningDeep Learnining
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xian, China
J
John H.L. Hansen
Center for Robust Speech Systems (CRSS), The University of Texas at Dallas, USA