ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the computational bottleneck in the prefill phase of long-context retrieval-augmented generation (RAG), where existing methods under limited recomputation budgets prioritize globally salient but query-irrelevant tokens, leading to critical information loss. To overcome this, the authors propose a query-driven KV cache reuse mechanism that dynamically evaluates token relevance to the user query and integrates a two-stage recomputation pipeline with cross-layer attention fusion to selectively recompute high-value tokens. Their approach introduces, for the first time, a query-aware token prioritization scheme that effectively mitigates the “crowding-out” effect. With only 20% recomputation, it preserves 96%–101% of full prefill accuracy and achieves substantial gains—8.8%–24.9% on RULER and 18.6%–50.9% on LongBench—outperforming state-of-the-art methods such as CacheBlend, EPIC, and KVShare.

Technology Category

Application Category

📝 Abstract
The prefill stage of long-context Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by a user query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental"crowding-out effect"in current token selection criteria: globally salient but user-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering the user query and degrading inference accuracy. We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to the user query and employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and the user query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%-24.9% on RULER and 18.6%-50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare).
Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation
KV Cache Reuse
Crowding-out Effect
Selective Recomputation
Long-context Inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
KV Cache Reuse
Selective Recomputation
Query-Driven Attention
Efficient Inference
🔎 Similar Papers
2024-05-26Proceedings of the Twentieth European Conference on Computer SystemsCitations: 7