🤖 AI Summary
In speech in-context learning (SICL), the absence of systematic context example selection methods limits performance in complex speech scenarios. To address this, we propose a fine-tuning-free, semantics-driven example retrieval framework: for the first time, we integrate pretrained text embeddings with k-nearest neighbors (KNN) to enable cross-modal semantic alignment for context example selection in SICL. Our method requires no training—only off-the-shelf text encoders are used to retrieve semantically relevant context examples for given speech inputs. Evaluated on challenging tasks—including accented English, multilingual speech, and children’s speech—it achieves substantial improvements, reducing word error rate by up to 84.7% over zero-shot baselines. The core contribution is a highly efficient, general-purpose, plug-and-play paradigm for speech context selection, establishing a novel pathway for speech understanding in multimodal large language models.
📝 Abstract
Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models' speech recognition ability without fine-tuning. Across challenging automatic speech recognition tasks, including accented English, multilingual speech, and children's speech, our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction. We conduct ablation studies to show the robustness and efficiency of our method.