🤖 AI Summary
To address the degradation in automatic speech recognition (ASR) accuracy for rare or out-of-vocabulary words, this paper proposes a retrieval-augmented generation (RAG)-based method for automatic context discovery. Unlike computationally expensive large language model (LLM)-driven generation or post-hoc correction paradigms, our approach employs lightweight embedding retrieval to rapidly identify task-relevant contextual information and seamlessly integrate it into the ASR decoding process. Our key contributions include: (i) a speech recognition–oriented context retrieval framework; (ii) joint optimization of semantic vector matching, LLM-guided prompt engineering, and context post-processing; and (iii) efficient, high-precision context injection with minimal computational overhead. Experiments on TED-LIUMv3, Earnings21, and SPGISpeech demonstrate up to a 17% relative word error rate (WER) reduction over the no-context baseline—approaching the performance of oracle context (24.1% WER reduction) and significantly outperforming existing generative context methods.
📝 Abstract
This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.