TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In speech in-context learning (SICL), the absence of systematic context example selection methods limits performance in complex speech scenarios. To address this, we propose a fine-tuning-free, semantics-driven example retrieval framework: for the first time, we integrate pretrained text embeddings with k-nearest neighbors (KNN) to enable cross-modal semantic alignment for context example selection in SICL. Our method requires no training—only off-the-shelf text encoders are used to retrieve semantically relevant context examples for given speech inputs. Evaluated on challenging tasks—including accented English, multilingual speech, and children’s speech—it achieves substantial improvements, reducing word error rate by up to 84.7% over zero-shot baselines. The core contribution is a highly efficient, general-purpose, plug-and-play paradigm for speech context selection, establishing a novel pathway for speech understanding in multimodal large language models.

Technology Category

Application Category

📝 Abstract
Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models' speech recognition ability without fine-tuning. Across challenging automatic speech recognition tasks, including accented English, multilingual speech, and children's speech, our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction. We conduct ablation studies to show the robustness and efficiency of our method.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speech recognition without fine-tuning models
Selecting effective examples for speech in-context learning
Improving recognition on accented and multilingual speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-Embedding KNN for speech context learning
Semantic context enhances speech recognition without fine-tuning
Works across accented, multilingual and children's speech
🔎 Similar Papers
No similar papers found.
H
Haolong Zheng
University of Illinois at Urbana-Champaign
Y
Yekaterina Yegorova
University of Illinois at Urbana-Champaign
Mark Hasegawa-Johnson
Mark Hasegawa-Johnson
Professor of Electrical and Computer Engineering, University of Illinois
SpeechAudioNatural Language Processing