Robust Unsupervised Adaptation of a Speech Recogniser Using Entropy Minimisation and Speaker Codes

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses unsupervised speaker adaptation for automatic speech recognition (ASR) in low-resource settings—specifically, adapting a pre-trained ASR model to a new speaker using only one minute of unlabeled speech. The proposed method enhances robustness through three key innovations: (1) a conditional entropy minimization loss over a full hypothesis set to mitigate pseudo-labeling errors; (2) a lightweight, learnable speaker embedding vector that drastically reduces data dependency; and (3) integration of multi-hypothesis decoding with noise-augmented training. Evaluated on a noise-augmented Common Voice dataset, the approach achieves a 20% relative reduction in word error rate (WER) using just one minute of adaptation data, rising to 29% with ten minutes—substantially outperforming existing unsupervised adaptation methods. The framework demonstrates strong generalization with minimal speaker-specific supervision, making it particularly suitable for real-world low-data scenarios.

Technology Category

Application Category

📝 Abstract
Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation parameters with cross-entropy on a single error-prone hypothesis or"pseudo-label", this paper proposes a novel loss function, the conditional entropy over complete hypotheses. Using multiple hypotheses makes adaptation more robust to errors in the initial recognition. Second, a"speaker code"characterises a speaker in a vector short enough that it requires little data to estimate. On a far-field noise-augmented version of Common Voice, the proposed scheme yields a 20% relative improvement in word error rate on one minute of adaptation data, increasing on 10 minutes to 29%.
Problem

Research questions and friction points this paper is trying to address.

Adapt speech recognisers to new speakers with minimal data
Improve robustness using conditional entropy over multiple hypotheses
Utilize speaker codes for efficient adaptation with little data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional entropy loss for robust adaptation
Speaker codes for efficient parameter estimation
Multiple hypotheses reduce recognition errors
🔎 Similar Papers
No similar papers found.