🤖 AI Summary
To address domain mismatch in automatic speech recognition (ASR) caused by the unavailability of in-domain training data, this paper proposes Retrieval-Augmented ASR (RAG-ASR): a zero-shot inference-time framework that dynamically retrieves domain-relevant textual knowledge and integrates it into the decoding process—without requiring any target-domain labeled data or model fine-tuning. This work is the first to adapt the Retrieval-Augmented Generation (RAG) paradigm to ASR, establishing a joint modeling architecture comprising a speech encoder and an LLM-based decoder, where domain knowledge is injected in real time via vector-based retrieval. Evaluated on the CSJ benchmark, RAG-ASR achieves state-of-the-art performance and significantly improves cross-domain recognition accuracy. Its core contribution lies in establishing a “zero-shot domain adaptation” paradigm, overcoming the conventional reliance on domain-specific data for fine-tuning or annotation.
📝 Abstract
Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Benefiting from the advantages of the RAG retrieval mechanism, our approach efficiently accesses locally available domain-specific documents, ensuring a convenient and effective process for solving domain mismatch problems. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data.