Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning

📅 2024-09-01

🏛️ Interspeech

📈 Citations: 1

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Children’s speech recognition faces challenges including data scarcity—particularly for non-English languages—and high acoustic variability; phoneme recognition for French children’s speech remains especially underexplored. This work presents the first systematic evaluation of wav2vec 2.0, HuBERT, and WavLM on French children’s phoneme recognition. We propose a full-network fine-tuning strategy for WavLM-base+, unfreezing all Transformer layers, integrated with Connectionist Temporal Classification (CTC) end-to-end modeling and comprehensive noise robustness analysis. Experiments demonstrate that the adapted WavLM significantly outperforms the baseline Transformer+CTC model, achieving superior generalization and stability on both real-world children’s reading speech and multi-level noisy conditions. Our approach establishes a transferable self-supervised learning paradigm for low-resource children’s speech recognition.

Technology Category

Application Category

📝 Abstract

Child speech recognition is still an underdeveloped area of research due to the lack of data (especially on non-English languages) and the specific difficulties of this task. Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech, and continue our experiments with the best of them, WavLM base+. We then further adapt it by unfreezing its transformer blocks during fine-tuning on child speech, which greatly improves its performance and makes it significantly outperform our base model, a Transformer+CTC. Finally, we study in detail the behaviour of these two models under the real conditions of our application, and show that WavLM base+ is more robust to various reading tasks and noise levels. Index Terms: speech recognition, child speech, self-supervised learning

Problem

Research questions and friction points this paper is trying to address.

Improving phoneme recognition in children's speech using self-supervised models.

Adapting WavLM base+ for French child speech recognition.

Enhancing robustness in reading tasks and noise conditions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted self-supervised models for phoneme recognition

Unfrozen transformer blocks enhance WavLM base+ performance

WavLM base+ robust in varied reading and noise conditions

🔎 Similar Papers

No similar papers found.