TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition

📅 2025-12-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Child speech recognition faces significant challenges due to high acoustic and linguistic variability, scarcity of labeled data, and substantial acoustic-phonetic divergence from adult speech. To address these, we propose Speech-based In-Context Learning (SICL), a fine-tuning-free framework that pioneers acoustic alignment in retrieval-based exemplar selection: it first performs coarse text-embedding k-nearest neighbor (k-NN) retrieval, then applies cross-modal re-ranking using wav2vec 2.0 acoustic features to jointly optimize semantic and acoustic similarity—thereby enhancing few-shot generalization. SICL is the first context learning method for child speech leveraging acoustic re-ranking and is compatible with LLM-based speech decoders. Evaluated on four benchmark child speech datasets, SICL reduces word error rate by up to 53.3% over zero-shot baselines and outperforms the prior Text-based In-Context Learning (TICL) by 37.6%, demonstrating the effectiveness and novelty of its multimodal alignment mechanism.

Technology Category

Application Category

📝 Abstract
Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children's speech.
Problem

Research questions and friction points this paper is trying to address.

Improves children's speech recognition by addressing acoustic and linguistic variability
Enhances in-context example selection using acoustic reranking for better adaptation
Reduces word error rates by combining semantic and acoustic information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Acoustic reranking enhances example selection
Combines semantic and acoustic alignment for ASR
Improves children's speech recognition without fine-tuning
🔎 Similar Papers
No similar papers found.
H
Haolong Zheng
University of Illinois at Urbana-Champaign
Y
Yekaterina Yegorova
University of Illinois at Urbana-Champaign
Mark Hasegawa-Johnson
Mark Hasegawa-Johnson
Professor of Electrical and Computer Engineering, University of Illinois
SpeechAudioNatural Language Processing