Cache Mechanism for Agent RAG Systems

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high storage overhead and retrieval latency incurred by large-scale knowledge bases in Retrieval-Augmented Generation (RAG) for LLM-based agents, this paper proposes ARC—the first unsupervised, dynamic caching mechanism tailored for agent-level RAG. ARC jointly models the geometric structure of the embedding space and historical query distributions to automatically identify highly relevant passages without supervision, enabling online cache construction, compact maintenance, and adaptive updates. Its core components include a similarity- and frequency-aware cache selection strategy, a lightweight indexing scheme, and a low-latency retrieval design. Experiments across three benchmark datasets demonstrate that ARC compresses cache size to just 0.015% of the original corpus while achieving up to 79.8% answer coverage and reducing average retrieval latency by 80%. This work establishes the first systematic solution to efficient cache management in agent-centric RAG scenarios.

Technology Category

Application Category

📝 Abstract
Recent advances in Large Language Model (LLM)-based agents have been propelled by Retrieval-Augmented Generation (RAG), which grants the models access to vast external knowledge bases. Despite RAG's success in improving agent performance, agent-level cache management, particularly constructing, maintaining, and updating a compact, relevant corpus dynamically tailored to each agent's need, remains underexplored. Therefore, we introduce ARC (Agent RAG Cache Mechanism), a novel, annotation-free caching framework that dynamically manages small, high-value corpora for each agent. By synthesizing historical query distribution patterns with the intrinsic geometry of cached items in the embedding space, ARC automatically maintains a high-relevance cache. With comprehensive experiments on three retrieval datasets, our experimental results demonstrate that ARC reduces storage requirements to 0.015% of the original corpus while offering up to 79.8% has-answer rate and reducing average retrieval latency by 80%. Our results demonstrate that ARC can drastically enhance efficiency and effectiveness in RAG-powered LLM agents.
Problem

Research questions and friction points this paper is trying to address.

Managing agent-level cache for RAG systems dynamically
Constructing compact relevant corpora for individual agent needs
Reducing storage requirements while maintaining retrieval performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic cache management for agent-specific corpora
Annotation-free framework using query patterns and embedding geometry
Reduces storage needs while improving retrieval speed and accuracy
🔎 Similar Papers
No similar papers found.
Shuhang Lin
Shuhang Lin
Rutgers, Phd student
NLP
Z
Zhencan Peng
Rutgers University
Lingyao Li
Lingyao Li
Assistant Professor, School of Information, University of South Florida
Generative AISocial ComputingUrban ComputingHealth Informatics
X
Xiao Lin
University of Illinois Urbana–Champaign
X
Xi Zhu
Rutgers University
Y
Yongfeng Zhang
Rutgers University