CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory overhead and access latency of KV caches in long-context LLM inference, this paper proposes a centroid-guided two-stage KV retrieval paradigm. In the first stage, leveraging the high similarity among adjacent query vectors after RoPE transformation, a lightweight block-level centroid index is constructed. In the second stage, precise token-level retrieval is performed within candidate blocks. The method introduces a novel CPU-GPU collaborative indexing and search mechanism with minimal overhead, enabling dynamic KV cache compression and selective retrieval. Evaluated on 96K-context benchmarks, it achieves 3× and 4× throughput improvements for Llama-3-8B and Yi-9B, respectively, with <1% accuracy degradation—significantly outperforming existing block-level and token-level approaches.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.
Problem

Research questions and friction points this paper is trying to address.

Improves KV cache retrieval efficiency for long-context LLMs
Reduces memory overhead and latency in long-context inference
Balances accuracy and speed in dynamic KV selection methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage centroid-then-token KV retrieval strategy
Lightweight centroid precomputation during prefilling phase
CPU-GPU co-execution for optimized indexing and search
🔎 Similar Papers
No similar papers found.
K
Kuan Lu
Zhejiang University, China
Shuhang Lin
Shuhang Lin
Rutgers, Phd student
NLP
Sai Wu
Sai Wu
Professor, Zhejiang University
Distributed DatabaseAI for DB
Yichen Yao
Yichen Yao
ShanghaiTech
computer vision
J
Junhan Yang
INFLY Tech, China
H
Huan Li
Zhejiang University, China
W
Wei Chu
INFLY Tech, China
X
Xu Yinghui
INFLY Tech, China
Y
Yuan Qi
INFLY Tech, China
G
Gang Chen
Zhejiang University, China