ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the computational bottleneck in the prefill phase of long-context retrieval-augmented generation (RAG), where existing methods under limited recomputation budgets prioritize globally salient but query-irrelevant tokens, leading to critical information loss. To overcome this, the authors propose a query-driven KV cache reuse mechanism that dynamically evaluates token relevance to the user query and integrates a two-stage recomputation pipeline with cross-layer attention fusion to selectively recompute high-value tokens. Their approach introduces, for the first time, a query-aware token prioritization scheme that effectively mitigates the “crowding-out” effect. With only 20% recomputation, it preserves 96%–101% of full prefill accuracy and achieves substantial gains—8.8%–24.9% on RULER and 18.6%–50.9% on LongBench—outperforming state-of-the-art methods such as CacheBlend, EPIC, and KVShare.

Technology Category

Application Category

📝 Abstract
The prefill stage of long-context Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by a user query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental"crowding-out effect"in current token selection criteria: globally salient but user-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering the user query and degrading inference accuracy. We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to the user query and employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and the user query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%-24.9% on RULER and 18.6%-50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare).
Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation
KV Cache Reuse
Crowding-out Effect
Selective Recomputation
Long-context Inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
KV Cache Reuse
Selective Recomputation
Query-Driven Attention
Efficient Inference
🔎 Similar Papers
No similar papers found.
S
Shihao Wang
Harbin Institute of Technology, Shenzhen
J
Jiahao Chen
Harbin Institute of Technology, Shenzhen
Y
Yanqi Pan
Harbin Institute of Technology, Shenzhen
Hao Huang
Hao Huang
Department of Social Sciences, Illinois Institute of Technology
Foreign direct investment and tradehousing and built environmentand GIS
Y
Yichen Hao
Harbin Institute of Technology, Shenzhen
X
Xiangyu Zou
Harbin Institute of Technology, Shenzhen
Wen Xia
Wen Xia
Harbin Institute of Technology, Shenzhen
storage systemsdata deduplicationdata compressioncloud storage
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved
C
Chongyang Qiu
Beijing Yanrong Technology Co., Ltd.
P
Pengfei Wang
Beijing Yanrong Technology Co., Ltd.