Predictive Prefetching for Retrieval-Augmented Generation

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing asynchronous retrieval-augmented generation (RAG) systems rely on heuristic coordination strategies that struggle to adapt to dynamically evolving information needs across diverse domains, resulting in limited efficiency and flexibility. To address this, this work proposes a novel asynchronous retrieval framework that leverages semantic precursors emerging early in the generation process to explicitly predict both the optimal timing and content for retrieval, enabling intelligent prefetching aligned with dynamic user demands. The framework integrates a retrieval predictor, a context monitor, and a query generator, jointly modeling semantic precursors and evolving information requirements. Experimental results demonstrate that the approach reduces end-to-end latency by up to 43.5% and accelerates first-token output speed by up to 62.4% across multiple benchmarks, while maintaining answer quality comparable to that of synchronous RAG systems.

📝 Abstract

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

latency

asynchronous retrieval

information demand

predictive prefetching

Innovation

Methods, ideas, or system contributions that make the work stand out.

predictive prefetching

asynchronous retrieval

retrieval-augmented generation