CacheClip: Accelerating RAG with Effective KV Cache Reuse

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
RAG systems suffer from significantly increased time-to-first-token (TTFT) due to long inputs, and existing KV cache reuse methods fail to balance efficiency and generation quality: prefix caching relies on identical prefixes—rare in RAG—while precomputation-based approaches degrade quality by ignoring cross-block attention and redundant attention sinks. This paper proposes CacheClip, a novel framework that introduces a lightweight auxiliary model to predict critical tokens for restoring cross-block attention, and designs a shared-prefix mechanism to eliminate redundant sinks alongside a locality-aware grouped update strategy. Evaluated on NIAH and LongBench, CacheClip retains 94.8% and 85.0% of full-attention accuracy, respectively—outperforming APE and CacheBlend by 25.2% and 35.1%—and achieves up to 1.92× speedup during the prefill phase.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial KV cache updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 1.92x in prefill time, providing a practical solution to the efficiency-quality trade-off in RAG systems.
Problem

Research questions and friction points this paper is trying to address.

Accelerating RAG systems by reducing time-to-first-token bottlenecks
Improving KV cache reuse quality while maintaining generation accuracy
Addressing inter-chunk attention loss in cross-chunk reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses auxiliary LLM to guide token selection
Shares prefixes to remove redundant attention sinks
Employs grouping strategy for local coherence
🔎 Similar Papers
2024-10-04arXiv.orgCitations: 1
B
Bin Yang
Intel Corporation, Shanghai, China.
Q
Qiuyu Leng
Intel Corporation, Shanghai, China.
Jun Zeng
Jun Zeng
University of California, Berkeley
Robotics
Z
Zhenhua Wu
Intel Corporation, Shanghai, China.