🤖 AI Summary
This study investigates whether noise-augmented training can simultaneously enhance both adversarial robustness and speech intelligibility of automatic speech recognition (ASR) systems. We conduct systematic experiments comparing multiple noise sources (white Gaussian noise, babble, music), injection strategies (time-domain vs. frequency-domain), and noise intensities, under adversarial attacks guided by projected gradient descent (PGD) and CTC loss, evaluated on Conformer and fine-tuned Whisper models. We propose the first quantitative model linking noise-augmentation strategies to adversarial robustness and introduce a “robustness–intelligibility” trade-off evaluation framework, moving beyond conventional accuracy-only metrics. On LibriSpeech and VoxCeleb, our approach achieves an average 12.7% improvement in adversarial accuracy with <1.5% word error rate (WER) degradation. Results demonstrate that frequency-domain noise injection significantly outperforms time-domain injection, revealing key mechanisms and practical boundaries for robustness enhancement.