Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Chain-of-thought reasoning incurs substantial KV cache overhead, and existing compression methods often compromise reasoning integrity. Method: This paper first reveals functional heterogeneity among attention heads during inference and proposes RLKV—a reinforcement learning–based framework for identifying critical attention heads. RLKV employs end-to-end reward feedback to precisely identify and retain only those heads essential for reasoning, preserving their KV caches losslessly; for all other heads, it applies constant-ratio compression. Additionally, RLKV enables decoupled KV cache allocation. Results: Under 20%–50% KV cache compression, RLKV achieves near-lossless inference performance (<0.5% accuracy degradation), significantly outperforming baseline methods such as token dropping or uniform compression. It establishes a novel paradigm for efficient deployment of reasoning-intensive large language models.

Technology Category

Application Category

📝 Abstract
Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.
Problem

Research questions and friction points this paper is trying to address.

Identifies reasoning-critical attention heads in large language models
Reduces KV cache overhead while maintaining reasoning integrity
Uses reinforcement learning to optimize head compression for reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning identifies reasoning-critical attention heads
Allocate full cache to critical heads and compressed cache to others
Achieves 20-50% KV cache reduction with near-lossless performance
🔎 Similar Papers
No similar papers found.
W
Wenjie Du
Westlake University
L
Li Jiang
McGill University, Mila
Keda Tao
Keda Tao
Westlake University
Generative ModelComputer VisionMLLM
X
Xue Liu
McGill University, MBZUAI
H
Huan Wang
Westlake University