Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the adversarial robustness of modern speech enhancement systems, revealing a critical security vulnerability: state-of-the-art models are highly susceptible to semantic-level adversarial attacks under psychoacoustic masking constraints. We propose a novel adversarial example generation method that jointly incorporates psychoacoustic modeling and gradient-based optimization, targeting mainstream autoregressive/predictive models (e.g., DCCRN, SEGAN). For comparison, we analyze the intrinsic robustness of diffusion-based models (e.g., DiffWave, VoiceFixer), leveraging their stochastic sampling mechanism. Experiments demonstrate that conventional models suffer severe semantic degradation under attack—despite SNR improvement, ASR error rates increase by 47.3% on average. In contrast, diffusion models exhibit markedly higher resilience (only +6.1% ASR error), attributable to implicit regularization in denoising and multi-step stochastic reconstruction. This work provides the first empirical evidence of diffusion architectures’ inherent security advantage in speech enhancement, establishing a new paradigm for designing robust speech processing systems.

Technology Category

Application Category

📝 Abstract
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.
Problem

Research questions and friction points this paper is trying to address.

Modern speech enhancement systems are vulnerable to adversarial attacks
Adversarial noise can manipulate enhanced speech semantic meaning
Diffusion models show inherent robustness against these attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial noise injection into speech enhancement
Psychoacoustically masked attacks alter semantic meaning
Diffusion models show inherent robustness to attacks
🔎 Similar Papers
No similar papers found.