SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

๐Ÿ“… 2025-12-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large reasoning models (LRMs) suffer from memory bottlenecks and reduced throughput due to linear growth of KV caches induced by long chain-of-thought (CoT) reasoning. Existing token-level KV eviction methods exhibit unstable scoring and padding-induced interference in multi-batch settings, often erroneously discarding semantically critical tokensโ€”thereby lengthening generation latency and degrading accuracy. This paper proposes the first training-free sentence-level KV compression method. It introduces a semantic-aware sentence scoring mechanism, coupled with dynamic steering vector guidance and hidden-state updates, enabling fine-grained KV deletion, coarse-grained token skipping, and controlled generation suppression. Evaluated across multiple reasoning benchmarks, our approach achieves up to 26.7% higher accuracy than state-of-the-art methods, reduces average generation length to 1.6ร— the original, and improves throughput by 1.7ร—.

Technology Category

Application Category

๐Ÿ“ Abstract
Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present extbf{SkipKV}, a extbf{ extit{training-free}} KV compression method for selective extit{eviction} and extit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a extit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $mathbf{26.7}%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $mathbf{1.6} imes$ fewer generation length while improving throughput up to $mathbf{1.7} imes$.
Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache overhead in large reasoning models
Maintains accuracy in multi-batch inference settings
Suppresses redundant generation to improve throughput
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free KV compression method for selective eviction
Sentence-level scoring to remove similar sentences
Dynamic steering vector to suppress redundant generation
๐Ÿ”Ž Similar Papers
No similar papers found.