🤖 AI Summary
To address weak generalization across diverse speech variations and unstable pretraining in automatic speech recognition (ASR), this paper proposes an improved BEST-RQ self-supervised ASR framework. The method integrates BERT-style speech pretraining, random projection quantization, and cross-entropy optimization. Its key contributions are: (1) a joint loss function incorporating KL-divergence regularization to mitigate codebook collapse and enhance representation discriminability; and (2) a per-cluster multi-codebook quantization mechanism based on low-level feature clustering, improving acoustic modeling robustness and convergence speed. Evaluated on LibriSpeech, the approach achieves relative WER reductions of 23.8% on test-clean and 30.6% on test-other compared to baseline methods. Moreover, both pretraining and fine-tuning converge faster, with significantly improved training stability.
📝 Abstract
Self-supervised learning has been successfully used for various speech related tasks, including automatic speech recognition. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has achieved state-of-the-art results in speech recognition. In this work, we further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering. Preliminary experiments on train-100 split of LibriSpeech result in a relative improvement of 11.2% on test-clean by using multiple codebooks, utilizing a combination of cross-entropy and Kullback-Leibler divergence further reduces the word error rate by 4.5%. The proposed optimizations on full LibriSpeech pre-training and fine-tuning result in relative word error rate improvements of up to 23.8% on test-clean and 30.6% on test-other using 6 codebooks. Furthermore, the proposed setup leads to faster convergence in pre-training and fine-tuning and additionally stabilizes the pre-training.