TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition

📅 2025-12-20

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Child speech recognition faces significant challenges due to high acoustic and linguistic variability, scarcity of labeled data, and substantial acoustic-phonetic divergence from adult speech. To address these, we propose Speech-based In-Context Learning (SICL), a fine-tuning-free framework that pioneers acoustic alignment in retrieval-based exemplar selection: it first performs coarse text-embedding k-nearest neighbor (k-NN) retrieval, then applies cross-modal re-ranking using wav2vec 2.0 acoustic features to jointly optimize semantic and acoustic similarity—thereby enhancing few-shot generalization. SICL is the first context learning method for child speech leveraging acoustic re-ranking and is compatible with LLM-based speech decoders. Evaluated on four benchmark child speech datasets, SICL reduces word error rate by up to 53.3% over zero-shot baselines and outperforms the prior Text-based In-Context Learning (TICL) by 37.6%, demonstrating the effectiveness and novelty of its multimodal alignment mechanism.

Technology Category

Application Category

📝 Abstract

Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children's speech.

Problem

Research questions and friction points this paper is trying to address.

Improves children's speech recognition by addressing acoustic and linguistic variability

Enhances in-context example selection using acoustic reranking for better adaptation

Reduces word error rates by combining semantic and acoustic information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Acoustic reranking enhances example selection

Combines semantic and acoustic alignment for ASR

Improves children's speech recognition without fine-tuning

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation