Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high time-to-first-token (TTFT) and low throughput in retrieval-augmented generation (RAG)-enhanced large language model (LLM) inference caused by long-context processing, this paper proposes a multi-instance collaborative inference architecture leveraging a shared-disk key-value (KV) cache. The method introduces two core innovations: (1) the first formal modeling of document locality in RAG queries, enabling proactive KV cache pre-generation and cross-instance sharing; and (2) a queuing-delay-aware cache scheduling and prewarming strategy to optimize performance under resource constraints. Evaluated on a single-machine setup with 2 GPUs and 1 CPU, the approach achieves 15–71% higher throughput and reduces TTFT by 12–65% compared to baseline methods, significantly enhancing the efficiency and scalability of multi-instance RAG serving.

Technology Category

Application Category

📝 Abstract
Recent large language models (LLMs) face increasing inference latency as input context length and model size continue to grow. In particular, the retrieval-augmented generation (RAG) technique, which enhances LLM responses by incorporating external knowledge, exacerbates this issue by significantly increasing the number of input tokens. This expansion in token length leads to a substantial rise in computational overhead, particularly during the prefill stage, resulting in prolonged time-to-first-token (TTFT). To address this issue, this paper proposes a method to reduce TTFT by leveraging a disk-based key-value (KV) cache to lessen the computational burden during the prefill stage. We also introduce a disk-based shared KV cache management system, called Shared RAG-DCache, for multi-instance LLM RAG service environments. This system, together with an optimal system configuration, improves both throughput and latency under given resource constraints. Shared RAG-DCache exploits the locality of documents related to user queries in RAG, as well as the queueing delay in LLM inference services. It proactively generates and stores disk KV caches for query-related documents and shares them across multiple LLM instances to enhance inference performance. In experiments on a single host equipped with 2 GPUs and 1 CPU, Shared RAG-DCache achieved a 15~71% increase in throughput and up to a 12~65% reduction in latency, depending on the resource configuration.
Problem

Research questions and friction points this paper is trying to address.

Reduces inference latency in RAG-powered LLMs
Manages shared disk KV cache for multi-instance environments
Improves throughput and latency under resource constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disk-based KV cache reduces prefill computational overhead
Shared RAG-DCache manages multi-instance KV cache efficiently
Proactive disk KV generation enhances inference performance
Hyungwoo Lee
Hyungwoo Lee
Ajou University
Complex OxidesNano-electronicsLow-frequency Noise
K
Kihyun Kim
Dept. of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
J
Jinwoo Kim
Dept. of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
J
Jungmin So
Dept. of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
M
Myung-Hoon Cha
ETRI, Daejeon, Republic of Korea
H
Hong-Yeon Kim
ETRI, Daejeon, Republic of Korea
J
James J. Kim
Soteria Inc.
Youngjae Kim
Youngjae Kim
Professor, Department of Computer Science and Engineering, Sogang University
Operating SystemFile and Storage SystemDistributed System