Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

260K/year
🤖 AI Summary
This work addresses the high computational overhead of attention computation in long-context large language model inference and the challenge of jointly optimizing sparse attention with GPU–CPU hierarchical memory. To this end, the authors propose SPIN, a framework that unifies diverse sparse attention algorithms under a shared paged KV cache foundation for the first time. SPIN introduces a locality-aware bucketized LRU cache replacement policy and a two-level working-set-aware metadata layout, enabling end-to-end co-optimization of sparse computation and hierarchical storage. Implemented atop vLLM, SPIN achieves 1.66–5.66× higher throughput and 7–9× lower time-to-first-token latency compared to vLLM, while preserving model accuracy. Moreover, it reduces per-token generation time by up to 58% relative to naive sparse implementations.
📝 Abstract
Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different sparsity granularities onto a shared page-based KV substrate; (2) a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and uses a GPU-friendly bucketed LRU policy to cut PCIe round-trips; and (3) a two-level hierarchical metadata layout sized to the active working set rather than the worst-case address space. Built on vLLM with three representative sparse attention algorithms, SPIN delivers 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, and reduces TPOT by up to 58% over the original sparse-attention implementations.
Problem

Research questions and friction points this paper is trying to address.

long-context LLM serving
sparse attention
hierarchical KV storage
system bottleneck
KV cache
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Attention
Hierarchical Memory
KV Cache Management
Long-Context LLM Serving
System Co-design