BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant performance degradation of automatic speech recognition (ASR) in low-resource and out-of-domain scenarios, this paper proposes BEARD—a novel framework that introduces the self-supervised learning objective BEST-RQ into domain adaptation of the Whisper encoder for the first time, integrated with knowledge distillation to enable encoder-decoder co-optimization. BEARD freezes the pre-trained Whisper teacher encoder and performs self-supervised pre-training on unlabeled speech, followed by fine-tuning using only 2 hours of labeled data. On the aviation communication ATCO2 dataset, leveraging just 5,000 hours of unlabeled speech yields a 12% relative word error rate reduction over standard fine-tuning. Key contributions are: (1) the first self-supervised domain adaptation paradigm tailored for Whisper; and (2) enhanced cross-domain robustness without modifying the decoder—particularly beneficial for non-native speech, noisy environments, and domains dense with domain-specific terminology.

Technology Category

Application Category

📝 Abstract
Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.
Problem

Research questions and friction points this paper is trying to address.

Adapting Whisper ASR to low-resource domains using unlabeled data
Addressing speech recognition challenges in noisy ATC communications
Improving ASR performance with self-supervised learning and distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

BEARD framework adapts Whisper encoder using unlabeled data
Combines BEST-RQ objective with knowledge distillation from teacher
Ensures encoder complementarity with pre-trained decoder
🔎 Similar Papers
No similar papers found.
R
Raphael Bagat
Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
I
I. Illina
Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
Emmanuel Vincent
Emmanuel Vincent
Senior Research Scientist, Inria
speech & audio