FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference

📅 2025-05-28
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-context LLM inference, KV cache read latency increases significantly with context length, while existing page-level retrieval methods suffer from low precision due to sparse distribution of critical tokens. To address this, we propose a fine-grained KV cache retrieval method: (1) introducing 1-bit quantized keys for token-level importance estimation; (2) integrating dynamic query-relevance matching; and (3) employing importance-score-driven cache management. This approach transcends the coarse granularity of page-level methods, enabling precise identification and retention of sparse, high-value tokens. Experiments demonstrate that our method restores full KV cache performance using only 11% of the original cache budget, reduces decoding latency by 1.2–1.5×, and substantially improves inference efficiency for long-context workloads.

Technology Category

Application Category

📝 Abstract
The Key-Value (KV) cache reading latency increases significantly with context lengths, hindering the efficiency of long-context LLM inference. To address this, previous works propose retaining a small fraction of KV cache based on token importance. For example, KV eviction uses static heuristics to retain tokens, while KV retrieval dynamically selects query-relevant tokens for more adaptive cache management. However, we observe that important tokens are often sparsely distributed across the long context. This sparsity makes existing page-level KV retrieval inaccurate, as each page may include irrelevant tokens and miss critical ones. In this work, we propose Fier, a underline{Fi}ne-Grained and underline{E}fficient KV cache underline{R}etrieval method. Fier uses 1-bit quantized keys to estimate the importance of each token, resulting in efficient and precise retrieval. Experiments show that Fier matches full KV performance using only 11% of the cache budget across various long-context tasks, reducing decoding latency by 1.2$ imes$ to 1.5$ imes$.
Problem

Research questions and friction points this paper is trying to address.

KV cache reading latency increases with context length
Existing page-level retrieval misses sparse important tokens
Need efficient fine-grained KV cache retrieval method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained KV cache retrieval method
Uses 1-bit quantized keys
Reduces latency with sparse token distribution