HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of speech foundation models for automatic speech recognition (ASR) in noisy environments, this paper introduces the VICReg statistical regularization mechanism into the HuBERT pretraining framework. By jointly constraining variance, invariance, and covariance in the representation space, the method explicitly optimizes the statistical properties of noisy speech embeddings, thereby enhancing generalization to unseen noise types. Crucially, it operates in a fully self-supervised manner—requiring neither noise labels nor paired clean-noisy data—and achieves robust representation learning directly during pretraining. Evaluated on LibriSpeech, the proposed approach yields relative word error rate (WER) reductions of 23.3% on test-clean and 13.2% on test-other compared to the standard HuBERT baseline. These results demonstrate substantially improved cross-noise-scenario adaptability and stability, establishing a new state-of-the-art in unsupervised robust ASR pretraining.

Technology Category

Application Category

📝 Abstract
Noise robustness in speech foundation models (SFMs) has been a critical challenge, as most models are primarily trained on clean data and experience performance degradation when the models are exposed to noisy speech. To address this issue, we propose HuBERT-VIC, a noise-robust SFM with variance, in-variance, and covariance regularization (VICReg) objectives. These objectives adjust the statistics of noisy speech representations, enabling the model to capture diverse acoustic characteristics and improving the generalization ability across different types of noise. When applied to HuBERT, our model shows relative performance improvements of 23.3% on LibriSpeech test-clean and 13.2% on test-other, compared to the baseline model pre-trained on noisy speech.
Problem

Research questions and friction points this paper is trying to address.

Enhancing noise robustness in speech foundation models
Improving performance on noisy speech recognition
Generalizing across diverse acoustic noise types
Innovation

Methods, ideas, or system contributions that make the work stand out.

VICReg objectives for noise robustness
Adjusts noisy speech representation statistics
Improves generalization across noise types
🔎 Similar Papers
No similar papers found.
H
Hyebin Ahn
School of Electrical Engineering, KAIST, Republic of Korea
K
Kangwook Jang
School of Electrical Engineering, KAIST, Republic of Korea
Hoirin Kim
Hoirin Kim
Professor of Electrical Engineering, KAIST