VoiceAgengRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high latency introduced by retrieval-augmented generation (RAG) in real-time spoken dialogue systems by proposing the first dual-agent RAG architecture tailored for this setting. The approach decouples retrieval and generation: a background “slow-thinking” agent proactively prefetches relevant documents based on dialogue context prediction and stores them in a semantic cache, while a foreground “fast-speaking” agent generates responses exclusively from this cache. Leveraging predictive prefetching and an efficient caching mechanism, the system bypasses vector database queries entirely upon cache hits, achieving sub-millisecond response times. Experimental results demonstrate that this method substantially alleviates the latency bottleneck of conventional RAG pipelines in real-time voice applications.

Technology Category

Application Category

📝 Abstract
We present VoiceAgentRAG, an open-source dual-agent memory router that decouples retrieval from response generation. A background Slow Thinker agent continuously monitors the conversation stream, predicts likely follow-up topics using an LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. A foreground Fast Talker agent reads only from this sub-millisecond cache, bypassing the vector database entirely on cache hits.
Problem

Research questions and friction points this paper is trying to address.

RAG
latency
real-time voice agents
retrieval bottleneck
dual-agent architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-agent architecture
RAG latency optimization
semantic cache
prefetching
real-time voice agents
🔎 Similar Papers
No similar papers found.