An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses the GPU bandwidth bottleneck and high metadata overhead caused by CPU-resident KV caches in long-context inference, as well as the difficulty of existing sparse attention methods in balancing end-to-end efficiency and load balancing. The paper proposes the first CPU-GPU hybrid sparse attention framework that co-optimizes accuracy and system efficiency through output-aware KV budget allocation, head-specific and granularity-aware sparsity configuration, and a cross-device task scheduling mechanism. Key components include a lightweight head attribute predictor, a granularity-budget selector, and a priority scheduler, which jointly optimize sparsity patterns, cache management, and computation overlap. Experiments across two models, three benchmarks, and 40 tasks show an average quality degradation of only 0.26% while achieving 1.5–3.7× speedup over the strongest fixed-sparsity baseline.

📝 Abstract

Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well -- the worst average degradation is only -0.26 relative to FULL, while delivering 1.5$\times$-3.7$\times$ speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05.

Problem

Research questions and friction points this paper is trying to address.

long-context inference

sparse attention

CPU-GPU parallelism

KV cache

hybrid execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid sparse attention

CPU-GPU parallelism

KV cache management