Personalized Speech Recognition for Children with Test-Time Adaptation

📅 2024-09-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant degradation in generalization performance of adult-pretrained automatic speech recognition (ASR) models on children’s speech due to domain shift, this paper introduces the first systematic unsupervised test-time adaptation (TTA) framework specifically designed for child speech. Without requiring any manual annotations or model fine-tuning, the method dynamically adapts—online and per-speaker—to each child’s acoustic characteristics via speech feature alignment and self-supervised consistency optimization, enabling robust modeling of both inter- and intra-child domain shifts. Experiments across multiple child speech benchmarks demonstrate that the proposed TTA substantially reduces average word error rate (WER); improvements are consistent across individuals and statistically significant, validating its effectiveness in capturing the high heterogeneity of children’s speech. This work establishes a scalable, test-time self-adaptation paradigm for low-resource, highly personalized ASR.

Technology Category

Application Category

📝 Abstract
Accurate automatic speech recognition (ASR) for children is crucial for effective real-time child-AI interaction, especially in educational applications. However, off-the-shelf ASR models primarily pre-trained on adult data tend to generalize poorly to children's speech due to the data domain shift from adults to children. Recent studies have found that supervised fine-tuning on children's speech data can help bridge this domain shift, but human annotations may be impractical to obtain for real-world applications and adaptation at training time can overlook additional domain shifts occurring at test time. We devised a novel ASR pipeline to apply unsupervised test-time adaptation (TTA) methods for child speech recognition, so that ASR models pre-trained on adult speech can be continuously adapted to each child speaker at test time without further human annotations. Our results show that ASR models adapted with TTA methods significantly outperform the unadapted off-the-shelf ASR baselines both on average and statistically across individual child speakers. Our analysis also discovered significant data domain shifts both between child speakers and within each child speaker, which further motivates the need for test-time adaptation.
Problem

Research questions and friction points this paper is trying to address.

Adapting ASR models for child speech domain shifts
Studying TTA for personalized child speech recognition
Evaluating TTA effectiveness on child speech variability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time adaptation for child speech recognition
Unsupervised adaptation using SUTA and SGEM methods
Improves ASR models for individual child variability
🔎 Similar Papers
No similar papers found.
Z
Zhonghao Shi
University of Southern California, Los Angeles, USA
H
Harshvardhan Srivastava
Columbia University, New York, USA
X
Xuan Shi
University of Southern California, Los Angeles, USA
S
Shrikanth S. Narayanan
University of Southern California, Los Angeles, USA
M
Maja Matari'c
University of Southern California, Los Angeles, USA