🤖 AI Summary
This work addresses unsupervised speaker adaptation for automatic speech recognition (ASR) in low-resource settings—specifically, adapting a pre-trained ASR model to a new speaker using only one minute of unlabeled speech. The proposed method enhances robustness through three key innovations: (1) a conditional entropy minimization loss over a full hypothesis set to mitigate pseudo-labeling errors; (2) a lightweight, learnable speaker embedding vector that drastically reduces data dependency; and (3) integration of multi-hypothesis decoding with noise-augmented training. Evaluated on a noise-augmented Common Voice dataset, the approach achieves a 20% relative reduction in word error rate (WER) using just one minute of adaptation data, rising to 29% with ten minutes—substantially outperforming existing unsupervised adaptation methods. The framework demonstrates strong generalization with minimal speaker-specific supervision, making it particularly suitable for real-world low-data scenarios.
📝 Abstract
Speech recognisers usually perform optimally only in a specific environment and need to be adapted to work well in another. For adaptation to a new speaker, there is often too little data for fine-tuning to be robust, and that data is usually unlabelled. This paper proposes a combination of approaches to make adaptation to a single minute of data robust. First, instead of estimating the adaptation parameters with cross-entropy on a single error-prone hypothesis or"pseudo-label", this paper proposes a novel loss function, the conditional entropy over complete hypotheses. Using multiple hypotheses makes adaptation more robust to errors in the initial recognition. Second, a"speaker code"characterises a speaker in a vector short enough that it requires little data to estimate. On a far-field noise-augmented version of Common Voice, the proposed scheme yields a 20% relative improvement in word error rate on one minute of adaptation data, increasing on 10 minutes to 29%.