Optimized Self-supervised Training with BEST-RQ for Speech Recognition

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak generalization across diverse speech variations and unstable pretraining in automatic speech recognition (ASR), this paper proposes an improved BEST-RQ self-supervised ASR framework. The method integrates BERT-style speech pretraining, random projection quantization, and cross-entropy optimization. Its key contributions are: (1) a joint loss function incorporating KL-divergence regularization to mitigate codebook collapse and enhance representation discriminability; and (2) a per-cluster multi-codebook quantization mechanism based on low-level feature clustering, improving acoustic modeling robustness and convergence speed. Evaluated on LibriSpeech, the approach achieves relative WER reductions of 23.8% on test-clean and 30.6% on test-other compared to baseline methods. Moreover, both pretraining and fine-tuning converge faster, with significantly improved training stability.

Technology Category

Application Category

📝 Abstract
Self-supervised learning has been successfully used for various speech related tasks, including automatic speech recognition. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has achieved state-of-the-art results in speech recognition. In this work, we further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering. Preliminary experiments on train-100 split of LibriSpeech result in a relative improvement of 11.2% on test-clean by using multiple codebooks, utilizing a combination of cross-entropy and Kullback-Leibler divergence further reduces the word error rate by 4.5%. The proposed optimizations on full LibriSpeech pre-training and fine-tuning result in relative word error rate improvements of up to 23.8% on test-clean and 30.6% on test-other using 6 codebooks. Furthermore, the proposed setup leads to faster convergence in pre-training and fine-tuning and additionally stabilizes the pre-training.
Problem

Research questions and friction points this paper is trying to address.

Speech Recognition
Accuracy Improvement
Diverse Voice Types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced Self-Learning Method
Improved BEST-RQ Technique
Unsupervised Learning for Speech Recognition
🔎 Similar Papers
No similar papers found.