An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems

πŸ“… 2026-01-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the vulnerability of speaker recognition systems to adversarial attacks under the threat of deepfake voice misuse and the lack of effective defenses. To this end, the authors propose a Masked Energy-based Perturbation (MEP) method that leverages human auditory characteristics and power spectral energy distribution to generate perturbations in the frequency domain by masking low-energy regions. This approach significantly enhances attack stealth and effectiveness while preserving speech naturalness. Experimental results demonstrate that MEP achieves superior evasion success rates across mainstream models such as ECAPA-TDNN and ResNet34, outperforming both FGSM and its iterative variants. Moreover, perceptual evaluation using PESQ reveals only a 26.68% relative degradation in speech quality, confirming MEP’s high attack efficiency with minimal perceptible distortion.

Technology Category

Application Category

πŸ“ Abstract
Evasion attacks pose significant threats to AI systems, exploiting vulnerabilities in machine learning models to bypass detection mechanisms. The widespread use of voice data, including deepfakes, in promising future industries is currently hindered by insufficient legal frameworks. Adversarial attack methods have emerged as the most effective countermeasure against the indiscriminate use of such data. This research introduces masked energy perturbation (MEP), a novel approach using power spectrum for energy masking of original voice data. MEP applies masking to small energy regions in the frequency domain before generating adversarial perturbations, targeting areas less noticeable to the human auditory model. The study primarily employs advanced speaker recognition models, including ECAPA-TDNN and ResNet34, which have shown remarkable performance in speaker verification tasks. The proposed MEP method demonstrated strong performance in both audio quality and evasion effectiveness. The energy masking approach effectively minimizes the perceptual evaluation of speech quality (PESQ) degradation, indicating that minimal perceptual distortion occurs to the human listener despite the adversarial perturbations. Specifically, in the PESQ evaluation, the relative performance of the MEP method was 26.68% when compared to the fast gradient sign method (FGSM) and iterative FGSM.
Problem

Research questions and friction points this paper is trying to address.

adversarial evasion attacks
speaker recognition
misclassification
energy masking
voice data
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked energy perturbation
adversarial evasion attack
speaker recognition
energy masking
PESQ
πŸ”Ž Similar Papers
No similar papers found.