MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

๐Ÿ“… 2025-11-08
๐Ÿ›๏ธ IEEE computer architecture letters
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address memory bottlenecks and low GPU utilization caused by KV caching in long-context LLM inference, this paper proposes Mixture of Shared KV Attention (MoSKA). Methodologically, MoSKA explicitly distinguishes request-specific from shared context across concurrent requests to enable cross-request KV cache reuse; reformulates shared-KV attention computationโ€”from memory-bound GEMV operations into compute-bound batched GEMM; and integrates an MoE-inspired sparse attention mechanism with a software-hardware co-designed disaggregated architecture. Experimental results demonstrate that, under high-context-sharing workloads, MoSKA achieves up to 538.7ร— higher throughput than baseline systems, significantly improving scalability and hardware efficiency for long-sequence inference.

Technology Category

Application Category

๐Ÿ“ Abstract
The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7x over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Addresses KV cache bottleneck in long-sequence LLM inference
Exploits context heterogeneity between unique and shared sequences
Transforms memory-bound attention into compute-bound operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shared KV Attention converts GEMV to GEMM
MoE-inspired sparse attention prunes search space
Disaggregated Infrastructure specializes hardware for data
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Myunghyun Rhee
SK hynix Inc., Icheon-si, Gyeonggi-do 17336, Republic of Korea
S
Sookyung Choi
SK hynix Inc., Icheon-si, Gyeonggi-do 17336, Republic of Korea
E
Euiseok Kim
SK hynix Inc., Icheon-si, Gyeonggi-do 17336, Republic of Korea
Joonseop Sim
Joonseop Sim
SK Hynix
Computer architectureMemory HierarchyData analytics
Y
Youngpyo Joo
SK hynix Inc., Icheon-si, Gyeonggi-do 17336, Republic of Korea
H
Hoshik Kim
SK hynix Inc., Icheon-si, Gyeonggi-do 17336, Republic of Korea