Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning

📅 2024-09-01
🏛️ Interspeech
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Children’s speech recognition faces challenges including data scarcity—particularly for non-English languages—and high acoustic variability; phoneme recognition for French children’s speech remains especially underexplored. This work presents the first systematic evaluation of wav2vec 2.0, HuBERT, and WavLM on French children’s phoneme recognition. We propose a full-network fine-tuning strategy for WavLM-base+, unfreezing all Transformer layers, integrated with Connectionist Temporal Classification (CTC) end-to-end modeling and comprehensive noise robustness analysis. Experiments demonstrate that the adapted WavLM significantly outperforms the baseline Transformer+CTC model, achieving superior generalization and stability on both real-world children’s reading speech and multi-level noisy conditions. Our approach establishes a transferable self-supervised learning paradigm for low-resource children’s speech recognition.

Technology Category

Application Category

📝 Abstract
Child speech recognition is still an underdeveloped area of research due to the lack of data (especially on non-English languages) and the specific difficulties of this task. Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech, and continue our experiments with the best of them, WavLM base+. We then further adapt it by unfreezing its transformer blocks during fine-tuning on child speech, which greatly improves its performance and makes it significantly outperform our base model, a Transformer+CTC. Finally, we study in detail the behaviour of these two models under the real conditions of our application, and show that WavLM base+ is more robust to various reading tasks and noise levels. Index Terms: speech recognition, child speech, self-supervised learning
Problem

Research questions and friction points this paper is trying to address.

Improving phoneme recognition in children's speech using self-supervised models.
Adapting WavLM base+ for French child speech recognition.
Enhancing robustness in reading tasks and noise conditions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted self-supervised models for phoneme recognition
Unfrozen transformer blocks enhance WavLM base+ performance
WavLM base+ robust in varied reading and noise conditions
🔎 Similar Papers
No similar papers found.
L
Lucas Block Medin
Lalilo by Renaissance Learning, IRIT, Université de Toulouse, CNRS, Toulouse INP, UT3, Toulouse, France
Thomas Pellegrini
Thomas Pellegrini
Lecturer, University of Toulouse, IRIT
Automatic speech and audio processing
L
Lucile Gelin
Lalilo by Renaissance Learning, IRIT, Université de Toulouse, CNRS, Toulouse INP, UT3, Toulouse, France