🤖 AI Summary
To address the robustness deficiency of automatic speech recognition (ASR) systems in two-speaker scenarios, this paper proposes Selective Masking Adversarial Attack (SMA)—the first targeted masking attack method designed specifically for dual-source speech. SMA achieves speaker-level selective interference: it precisely preserves the target speaker’s speech while fully suppressing the interfering speaker’s voice in overlapped utterances. We design an optimization algorithm based on Gaussian initialization and iterative gradient updates, jointly incorporating Conformer-CTC model adaptation and signal-to-noise ratio (SNR) constraints to balance attack success rate and audio fidelity. Evaluated on a Conformer-CTC ASR model, SMA achieves a 100% targeted attack success rate with an average SNR of 37.15 dB—substantially outperforming existing baseline methods. This work establishes a novel paradigm for security evaluation of multi-speaker ASR systems.
📝 Abstract
Extensive research has shown that Automatic Speech Recognition (ASR) systems are vulnerable to audio adversarial attacks. Current attacks mainly focus on single-source scenarios, ignoring dual-source scenarios where two people are speaking simultaneously. To bridge the gap, we propose a Selective Masking Adversarial attack, namely SMA attack, which ensures that one audio source is selected for recognition while the other audio source is muted in dual-source scenarios. To better adapt to the dual-source scenario, our SMA attack constructs the normal dual-source audio from the muted audio and selected audio. SMA attack initializes the adversarial perturbation with a small Gaussian noise and iteratively optimizes it using a selective masking optimization algorithm. Extensive experiments demonstrate that the SMA attack can generate effective and imperceptible audio adversarial examples in the dual-source scenario, achieving an average success rate of attack of 100% and signal-to-noise ratio of 37.15dB on Conformer-CTC, outperforming the baselines.