Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of enhancing robustness and generalization of automatic speech recognition (ASR) models under low-resource, few-shot settings. We systematically investigate the factors driving transcription generalization and identify acoustic diversity—not linguistic diversity—as the primary determinant. Building on this insight, we propose the first acoustic-perception-oriented data augmentation paradigm, eliminating reliance on large-scale labeled datasets. Leveraging LibriSpeech (960 hours), we design controlled acoustic perturbations—including speech distortions, reverberation, and noise—to enrich acoustic variability. Evaluated on unseen test sets, our approach achieves up to a 19.24% relative reduction in word error rate (WER), significantly outperforming conventional augmentation methods. This work fundamentally advances understanding of the acoustic basis of ASR generalization and empirically demonstrates that high-robustness foundation models can be effectively trained using only medium-scale data.

Technology Category

Application Category

📝 Abstract
Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is lacking.
Problem

Research questions and friction points this paper is trying to address.

Reducing ASR model reliance on massive training data
Improving robustness via acoustic-aware data augmentation
Enhancing generalization with targeted acoustic variation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Acoustic-aware data augmentation enhances ASR robustness
Targeted acoustic augmentation reduces word-error rates
Strategic acoustic focus replaces massive datasets need
🔎 Similar Papers
No similar papers found.