Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study systematically evaluates the adversarial robustness of modern speech enhancement systems, revealing a critical security vulnerability: state-of-the-art models are highly susceptible to semantic-level adversarial attacks under psychoacoustic masking constraints. We propose a novel adversarial example generation method that jointly incorporates psychoacoustic modeling and gradient-based optimization, targeting mainstream autoregressive/predictive models (e.g., DCCRN, SEGAN). For comparison, we analyze the intrinsic robustness of diffusion-based models (e.g., DiffWave, VoiceFixer), leveraging their stochastic sampling mechanism. Experiments demonstrate that conventional models suffer severe semantic degradation under attack—despite SNR improvement, ASR error rates increase by 47.3% on average. In contrast, diffusion models exhibit markedly higher resilience (only +6.1% ASR error), attributable to implicit regularization in denoising and multi-step stochastic reconstruction. This work provides the first empirical evidence of diffusion architectures’ inherent security advantage in speech enhancement, establishing a new paradigm for designing robust speech processing systems.

Technology Category

Application Category

📝 Abstract

Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.

Problem

Research questions and friction points this paper is trying to address.

Modern speech enhancement systems are vulnerable to adversarial attacks

Adversarial noise can manipulate enhanced speech semantic meaning

Diffusion models show inherent robustness against these attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial noise injection into speech enhancement

Psychoacoustically masked attacks alter semantic meaning

Diffusion models show inherent robustness to attacks

🔎 Similar Papers

Comparative study on noise-augmented training and its effect on adversarial robustness in ASR systems