🤖 AI Summary
Child speech recognition faces significant challenges due to high acoustic and linguistic variability, scarcity of labeled data, and substantial acoustic-phonetic divergence from adult speech. To address these, we propose Speech-based In-Context Learning (SICL), a fine-tuning-free framework that pioneers acoustic alignment in retrieval-based exemplar selection: it first performs coarse text-embedding k-nearest neighbor (k-NN) retrieval, then applies cross-modal re-ranking using wav2vec 2.0 acoustic features to jointly optimize semantic and acoustic similarity—thereby enhancing few-shot generalization. SICL is the first context learning method for child speech leveraging acoustic re-ranking and is compatible with LLM-based speech decoders. Evaluated on four benchmark child speech datasets, SICL reduces word error rate by up to 53.3% over zero-shot baselines and outperforms the prior Text-based In-Context Learning (TICL) by 37.6%, demonstrating the effectiveness and novelty of its multimodal alignment mechanism.
📝 Abstract
Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children's speech.