Retrieval Augmented Generation based context discovery for ASR

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the degradation in automatic speech recognition (ASR) accuracy for rare or out-of-vocabulary words, this paper proposes a retrieval-augmented generation (RAG)-based method for automatic context discovery. Unlike computationally expensive large language model (LLM)-driven generation or post-hoc correction paradigms, our approach employs lightweight embedding retrieval to rapidly identify task-relevant contextual information and seamlessly integrate it into the ASR decoding process. Our key contributions include: (i) a speech recognition–oriented context retrieval framework; (ii) joint optimization of semantic vector matching, LLM-guided prompt engineering, and context post-processing; and (iii) efficient, high-precision context injection with minimal computational overhead. Experiments on TED-LIUMv3, Earnings21, and SPGISpeech demonstrate up to a 17% relative word error rate (WER) reduction over the no-context baseline—approaching the performance of oracle context (24.1% WER reduction) and significantly outperforming existing generative context methods.

Technology Category

Application Category

📝 Abstract
This work investigates retrieval augmented generation as an efficient strategy for automatic context discovery in context-aware Automatic Speech Recognition (ASR) system, in order to improve transcription accuracy in the presence of rare or out-of-vocabulary terms. However, identifying the right context automatically remains an open challenge. This work proposes an efficient embedding-based retrieval approach for automatic context discovery in ASR. To contextualize its effectiveness, two alternatives based on large language models (LLMs) are also evaluated: (1) large language model (LLM)-based context generation via prompting, and (2) post-recognition transcript correction using LLMs. Experiments on the TED-LIUMv3, Earnings21 and SPGISpeech demonstrate that the proposed approach reduces WER by up to 17% (percentage difference) relative to using no-context, while the oracle context results in a reduction of up to 24.1%.
Problem

Research questions and friction points this paper is trying to address.

Improving ASR transcription accuracy for rare terms
Automatically discovering relevant context for speech recognition
Evaluating retrieval-based context discovery against LLM alternatives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval augmented generation for automatic context discovery
Embedding-based retrieval approach for ASR contextualization
LLM-based context generation and transcript correction alternatives
🔎 Similar Papers
No similar papers found.
D
Dimitrios Siskos
Information Technologies Institute, Center for Research and Technology Hellas, Thessaloniki, Greece
S
Stavros Papadopoulos
Information Technologies Institute, Center for Research and Technology Hellas, Thessaloniki, Greece
Pablo Peso Parada
Pablo Peso Parada
AI Researcher - Samsung Research UK
signal processingmachine learningopen source hardwareaudiospeech
Jisi Zhang
Jisi Zhang
Samsung Research UK
speech separationspeech recognition
K
Karthikeyan Saravanan
Samsung Electronics R&D Institute UK (SRUK), London, United Kingdom
Anastasios Drosou
Anastasios Drosou
CERTH-ITI