RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the memory and bandwidth bottlenecks imposed by KV caching in large language models during long-context reasoning. It presents the first unified formulation of cache eviction and quantization as a rate-distortion optimization problem, where distortion is defined based on attention-induced information loss. Leveraging this framework, the authors propose a reverse water-filling algorithm that dynamically allocates bits at the token and channel levels. The method achieves remarkable efficiency: with only 2.48% of the original KV cache retained, it recovers 97.81% of the model’s original accuracy, outperforms the best baseline by 9.1% on average, and delivers a 4.5× decoding speedup alongside a 1.9× reduction in peak memory usage.

📝 Abstract

Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation. In this paper, we cast KV cache compression as a rate-distortion problem, under which eviction and quantization are two end-points of the same bit allocation scheme. This exposes the need to optimize them jointly, motivating our method, RDKV (Rate-Distortion KV cache compression). RDKV derives the weight of each token or channel from the distortion that compression induces on the attention computation. Based on these weights, it assigns each token or channel a bit-width ranging from full precision down to zero bits guided by reverse water-filling, applied once after the prefilling stage. Experiments on LongBench, RULER, and InfiniteBench show that RDKV outperforms the best evaluated baseline by 9.1% on average. On LongBench it recovers 97.81% of full-cache accuracy with only 2.48% cache retention. Compared with full-cache FlashAttention-2 decoding, it achieves 4.5x decode speedup and 1.9x peak memory reduction with 128K context length, while maintaining comparable performance.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression

rate-distortion optimization

memory bottleneck

large language models

long-context inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rate-Distortion Optimization

KV Cache Compression

Joint Eviction and Quantization