TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In speech in-context learning (SICL), the absence of systematic context example selection methods limits performance in complex speech scenarios. To address this, we propose a fine-tuning-free, semantics-driven example retrieval framework: for the first time, we integrate pretrained text embeddings with k-nearest neighbors (KNN) to enable cross-modal semantic alignment for context example selection in SICL. Our method requires no training—only off-the-shelf text encoders are used to retrieve semantically relevant context examples for given speech inputs. Evaluated on challenging tasks—including accented English, multilingual speech, and children’s speech—it achieves substantial improvements, reducing word error rate by up to 84.7% over zero-shot baselines. The core contribution is a highly efficient, general-purpose, plug-and-play paradigm for speech context selection, establishing a novel pathway for speech understanding in multimodal large language models.

Technology Category

Application Category

📝 Abstract

Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models' speech recognition ability without fine-tuning. Across challenging automatic speech recognition tasks, including accented English, multilingual speech, and children's speech, our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction. We conduct ablation studies to show the robustness and efficiency of our method.

Problem

Research questions and friction points this paper is trying to address.

Enhancing speech recognition without fine-tuning models

Selecting effective examples for speech in-context learning

Improving recognition on accented and multilingual speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-Embedding KNN for speech context learning

Semantic context enhances speech recognition without fine-tuning

Works across accented, multilingual and children's speech

🔎 Similar Papers

No similar papers found.

Authors to Follow