MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) excel at mathematical and formal logical tasks, yet their long chain-of-thought reasoning incurs prohibitive KV cache memory overhead, severely limiting deployment efficiency. To address this, we propose MemShare, a collaborative-filtering-based KV cache sharing method. We first identify significant semantic similarity among intermediate reasoning steps across sequences, then design a lightweight cache block matching mechanism enabling zero-copy cross-sequence KV reuse. MemShare requires no model architecture modifications or retraining and is fully compatible with mainstream inference frameworks. Experiments demonstrate that, while preserving or even improving reasoning accuracy, MemShare achieves up to 84.79% higher throughput compared to baseline KV cache management schemes, alongside substantial memory reduction. This work establishes a novel paradigm for efficient LRM inference.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) have achieved significant advances in mathematical reasoning and formal logic tasks. However, their tendency to generate lengthy chain-of-thought sequences leads to substantial memory overhead during inference. We observe that LRMs frequently produce highly similar intermediate reasoning steps, which correspond to similar KV cache states across layers. Motivated by this observation, we propose MemShare, a novel KV cache management approach that effectively reduces memory overhead. MemShare employs a collaborative filtering algorithm to efficiently identify reusable KV cache blocks and enables zero copy cache reuse to significantly reduce memory overhead, improve throughput while maintaining accuracy. Experimental results demonstrate that MemShare delivers up to 84.79% improvement in throughput while maintaining better accuracy compared to existing KV cache management methods.
Problem

Research questions and friction points this paper is trying to address.

Reduce memory overhead in Large Reasoning Models
Identify reusable KV cache blocks efficiently
Improve throughput while maintaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache reuse for memory efficiency
Collaborative filtering identifies reusable cache blocks
Zero copy cache reuse maintains accuracy