SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Large reasoning models (LRMs) suffer from memory bottlenecks and reduced throughput due to linear growth of KV caches induced by long chain-of-thought (CoT) reasoning. Existing token-level KV eviction methods exhibit unstable scoring and padding-induced interference in multi-batch settings, often erroneously discarding semantically critical tokens—thereby lengthening generation latency and degrading accuracy. This paper proposes the first training-free sentence-level KV compression method. It introduces a semantic-aware sentence scoring mechanism, coupled with dynamic steering vector guidance and hidden-state updates, enabling fine-grained KV deletion, coarse-grained token skipping, and controlled generation suppression. Evaluated across multiple reasoning benchmarks, our approach achieves up to 26.7% higher accuracy than state-of-the-art methods, reduces average generation length to 1.6× the original, and improves throughput by 1.7×.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present extbf{SkipKV}, a extbf{ extit{training-free}} KV compression method for selective extit{eviction} and extit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a extit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $mathbf{26.7}%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $mathbf{1.6} imes$ fewer generation length while improving throughput up to $mathbf{1.7} imes$.

Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache overhead in large reasoning models

Maintains accuracy in multi-batch inference settings

Suppresses redundant generation to improve throughput

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free KV compression method for selective eviction

Sentence-level scoring to remove similar sentences

Dynamic steering vector to suppress redundant generation

🔎 Similar Papers

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering