Retrieval Augmented Generation based context discovery for ASR

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address the degradation in automatic speech recognition (ASR) accuracy for rare or out-of-vocabulary words, this paper proposes a retrieval-augmented generation (RAG)-based method for automatic context discovery. Unlike computationally expensive large language model (LLM)-driven generation or post-hoc correction paradigms, our approach employs lightweight embedding retrieval to rapidly identify task-relevant contextual information and seamlessly integrate it into the ASR decoding process. Our key contributions include: (i) a speech recognition–oriented context retrieval framework; (ii) joint optimization of semantic vector matching, LLM-guided prompt engineering, and context post-processing; and (iii) efficient, high-precision context injection with minimal computational overhead. Experiments on TED-LIUMv3, Earnings21, and SPGISpeech demonstrate up to a 17% relative word error rate (WER) reduction over the no-context baseline—approaching the performance of oracle context (24.1% WER reduction) and significantly outperforming existing generative context methods.

Technology Category

Application Category

📝 Abstract

This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.

Problem

Research questions and friction points this paper is trying to address.

Improving ASR transcription accuracy for rare terms

Automatically discovering relevant context for speech recognition

Evaluating retrieval-based context discovery against LLM alternatives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval augmented generation for automatic context discovery

Embedding-based retrieval approach for ASR contextualization

LLM-based context generation and transcript correction alternatives

🔎 Similar Papers

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper