MEPIC: Memory Efficient Position Independent Caching for LLM Serving

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-context LLM serving, KV cache memory overhead is substantial; existing prefix caching and position-independent caching (PIC) suffer from low HBM sharing rates due to strict prefix matching constraints or block-level KV inconsistency. Method: This paper proposes a page-aligned, block-level position-independent caching mechanism. It introduces a novel RoPE-fused positional encoding kernel, block-level selective recomputation, and dynamic PE offset elimination—enabling fully consistent and shareable document/code-block KV caches across requests, batches, and positions—without model modification. Contribution/Results: The method breaks conventional positional and memory-layout constraints on caching. It reduces HBM usage by up to 2× over state-of-the-art PIC, and up to 5× in long-prompt scenarios, while preserving low latency and original model accuracy.

Technology Category

Application Category

📝 Abstract
Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content. We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes.
Problem

Research questions and friction points this paper is trying to address.

Optimizes KV cache reuse across varying positions and requests
Reduces memory usage for long prompts in LLM serving
Enables chunk-level sharing without model modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns chunk KV to paged storage for memory sharing
Shifts recomputation to block-level for request-specific handling
Removes positional encodings via RoPE fusion in attention kernel
🔎 Similar Papers
No similar papers found.
Q
Qian Wang
Huawei Technologies Canada Co., Ltd.
Z
Zahra Yousefijamarani
Huawei Technologies Canada Co., Ltd.
M
Morgan Lindsay Heisler
Huawei Technologies Canada Co., Ltd.
Rongzhi Gu
Rongzhi Gu
Tencent AI Lab
Speech separation
X
Xiaolong Bai
Huawei Technologies Co., Ltd.
S
Shan Yizhou
Huawei Technologies Co., Ltd.
W
Wei Zhang
Huawei Technologies Canada Co., Ltd.
W
Wang Lan
Huawei Technologies Canada Co., Ltd.
Ying Xiong
Ying Xiong
Clausthal University of Technology
Petroleum geologySedimentologyGeochemistry
Y
Yong Zhang
Huawei Technologies Canada Co., Ltd.
Zhenan Fan
Zhenan Fan
Staff Researcher at Huawei Technologies Canada
OptimizationLarge Language Model