CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the high memory overhead and access latency of KV caches in long-context LLM inference, this paper proposes a centroid-guided two-stage KV retrieval paradigm. In the first stage, leveraging the high similarity among adjacent query vectors after RoPE transformation, a lightweight block-level centroid index is constructed. In the second stage, precise token-level retrieval is performed within candidate blocks. The method introduces a novel CPU-GPU collaborative indexing and search mechanism with minimal overhead, enabling dynamic KV cache compression and selective retrieval. Evaluated on 96K-context benchmarks, it achieves 3× and 4× throughput improvements for Llama-3-8B and Yi-9B, respectively, with <1% accuracy degradation—significantly outperforming existing block-level and token-level approaches.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly applied in long-context scenarios such as multi-turn conversations. However, long contexts pose significant challenges for inference efficiency, including high memory overhead from Key-Value (KV) cache and increased latency due to excessive memory accesses. Recent methods for dynamic KV selection struggle with trade-offs: block-level indexing degrades accuracy by retrieving irrelevant KV entries, while token-level indexing incurs high latency from inefficient retrieval mechanisms. In this paper, we propose CTKVR, a novel centroid-then-token KV retrieval scheme that addresses these limitations. CTKVR leverages a key observation: query vectors adjacent in position exhibit high similarity after Rotary Position Embedding (RoPE) and share most of their top-k KV cache entries. Based on this insight, CTKVR employs a two-stage retrieval strategy: lightweight centroids are precomputed during prefilling for centroid-grained indexing, followed by token-level refinement for precise KV retrieval. This approach balances retrieval efficiency and accuracy. To further enhance performance, we implement an optimized system for indexing construction and search using CPU-GPU co-execution. Experimentally, CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation. Meanwhile, CTKVR delivers 3 times and 4 times throughput speedups on Llama-3-8B and Yi-9B at 96K context length across diverse GPU hardware.

Problem

Research questions and friction points this paper is trying to address.

Improves KV cache retrieval efficiency for long-context LLMs

Reduces memory overhead and latency in long-context inference

Balances accuracy and speed in dynamic KV selection methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage centroid-then-token KV retrieval strategy

Lightweight centroid precomputation during prefilling phase

CPU-GPU co-execution for optimized indexing and search

🔎 Similar Papers

PQCache: Product Quantization-based KVCache for Long Context LLM Inference