Optimized Self-supervised Training with BEST-RQ for Speech Recognition

📅 2025-01-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address weak generalization across diverse speech variations and unstable pretraining in automatic speech recognition (ASR), this paper proposes an improved BEST-RQ self-supervised ASR framework. The method integrates BERT-style speech pretraining, random projection quantization, and cross-entropy optimization. Its key contributions are: (1) a joint loss function incorporating KL-divergence regularization to mitigate codebook collapse and enhance representation discriminability; and (2) a per-cluster multi-codebook quantization mechanism based on low-level feature clustering, improving acoustic modeling robustness and convergence speed. Evaluated on LibriSpeech, the approach achieves relative WER reductions of 23.8% on test-clean and 30.6% on test-other compared to baseline methods. Moreover, both pretraining and fine-tuning converge faster, with significantly improved training stability.

Technology Category

Application Category

📝 Abstract

Self-supervised learning has been successfully used for various speech related tasks, including automatic speech recognition. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has achieved state-of-the-art results in speech recognition. In this work, we further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering. Preliminary experiments on train-100 split of LibriSpeech result in a relative improvement of 11.2% on test-clean by using multiple codebooks, utilizing a combination of cross-entropy and Kullback-Leibler divergence further reduces the word error rate by 4.5%. The proposed optimizations on full LibriSpeech pre-training and fine-tuning result in relative word error rate improvements of up to 23.8% on test-clean and 30.6% on test-other using 6 codebooks. Furthermore, the proposed setup leads to faster convergence in pre-training and fine-tuning and additionally stabilizes the pre-training.

Problem

Research questions and friction points this paper is trying to address.

Speech Recognition

Accuracy Improvement

Diverse Voice Types

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced Self-Learning Method

Improved BEST-RQ Technique

Unsupervised Learning for Speech Recognition

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations