🤖 AI Summary
Chain-of-thought (CoT) reasoning in large language models often generates verbose outputs, causing explosive growth in KV cache size and severely degrading inference efficiency. Existing training-free compression methods struggle to balance cache reduction with reasoning accuracy—particularly failing in CoT scenarios. This paper introduces the first training-free, redundancy-aware KV cache compression method tailored for reasoning models. We propose a dynamic KV selection algorithm that jointly leverages attention similarity and token-level semantic redundancy, integrated into a plug-and-play cache pruning framework. Experiments demonstrate that retaining only 10% of the original KV cache fully restores baseline performance (100% accuracy), reduces memory usage by 90%, and improves throughput by 6.6×. On mathematical reasoning tasks, using merely 16% of the cache achieves 105% of full-cache accuracy—outperforming all prior baselines across metrics.
📝 Abstract
Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.