🤖 AI Summary
This work addresses the performance limitations of automatic speech recognition (ASR) for children’s speech, which stem from data scarcity and domain mismatch between pre-trained self-supervised learning (SSL) models and the target child speech domain. To overcome this, the authors propose Delta SSL embeddings—defined as the difference between embeddings from a fine-tuned model and those from the original pre-trained model—to effectively capture task-specific information. They further integrate features from multiple SSL models, including HuBERT, WavLM, and wav2vec 2.0 (W2V2). Experimental results on the MyST children’s corpus demonstrate that fusing WavLM with Delta W2V2 embeddings achieves a word error rate (WER) of 9.64%, establishing a new state-of-the-art for SSL-based ASR on this task and yielding up to a 10% relative WER reduction.
📝 Abstract
Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. Fine-tuning SSL models on child speech induces shifts in the representation space. We hypothesize that delta SSL embeddings, defined as the differences between embeddings from a finetuned model and those from its pretrained counterpart, encode task-specific information that complements finetuned features from another SSL model. We evaluate multiple fusion strategies on the MyST childrens corpus using different models. Results show that delta embedding fusion with WavLM yields up to a 10 percent relative WER reduction for HuBERT and a 4.4 percent reduction for W2V2, compared to finetuned embedding fusion. Notably, fusing WavLM with delta W2V2 embeddings achieves a WER of 9.64, setting a new state of the art among SSL models on the MyST corpus. These findings demonstrate the effectiveness of delta embeddings and highlight feature fusion as a promising direction for advancing child ASR.