Leveraging Approximate Caching for Faster Retrieval-Augmented Generation

πŸ“… 2025-03-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
RAG systems suffer from high vector retrieval overhead, significantly degrading end-to-end inference latency. To address this, we propose Proximityβ€”a novel approximate key-value caching mechanism that introduces cross-query retrieval result reuse driven by query similarity, breaking the conventional assumption of query independence. Methodologically, Proximity employs cosine-similarity-based clustering, dynamic threshold matching, and a lightweight key hashing index for efficient cache lookup. We evaluate it on MMLU and MedRAG benchmarks. Experiments show that Proximity reduces retrieval latency by up to 59% while preserving answer accuracy; concurrently, vector database load is substantially alleviated. Furthermore, we quantitatively characterize the Pareto frontier between recall and speedup, revealing the inherent trade-off governed by the similarity threshold. This work establishes the first principled framework for similarity-aware, cache-accelerated RAG inference.

Technology Category

Application Category

πŸ“ Abstract
Retrieval-augmented generation (RAG) enhances the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large vector databases is computationally expensive. To address this, we introduce Proximity, an approximate key-value cache that optimizes the RAG workflow by leveraging similarities in user queries. Instead of treating each query independently, Proximity reuses previously retrieved documents when similar queries appear, reducing reliance on expensive vector database lookups. We evaluate Proximity on the MMLU and MedRAG benchmarks, demonstrating that it significantly improves retrieval efficiency while maintaining response accuracy. Proximity reduces retrieval latency by up to 59% while maintaining accuracy and lowers the computational burden on the vector database. We also experiment with different similarity thresholds and quantify the trade-off between speed and recall. Our work shows that approximate caching is a viable and effective strategy for optimizing RAG-based systems.
Problem

Research questions and friction points this paper is trying to address.

Optimizes retrieval-augmented generation (RAG) workflow efficiency.
Reduces retrieval latency by reusing cached documents for similar queries.
Balances speed and accuracy in RAG systems using approximate caching.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Proximity, an approximate key-value cache
Optimizes RAG workflow by reusing similar query results
Reduces retrieval latency by up to 59%
πŸ”Ž Similar Papers
No similar papers found.
S
Shai Bergman
Huawei Research, Zurich, Switzerland
Z
Zhang Ji
Huawei Research, Zurich, Switzerland
Anne-Marie Kermarrec
Anne-Marie Kermarrec
Professor, EPFL
Distributed systemssocial networksgossip protocols
D
Diana Petrescu
EPFL, Lausanne, Switzerland
Rafael Pires
Rafael Pires
EPFL
Computer SystemsDistributed SystemsData Security and PrivacyDistributed ML
M
Mathis Randl
EPFL, Lausanne, Switzerland
Martijn de Vos
Martijn de Vos
Postdoctoral Researcher, EPFL
Distributed systemsdecentralized systemsdecentralized learning