AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
This work addresses the performance bottleneck caused by KV cache loading in long-context large language model inference and the accuracy degradation stemming from existing block-sparse attention methods that employ a uniform block size across all attention heads, ignoring their varying sensitivity to block granularity. To overcome these limitations, the paper proposes a training-free algorithm-system co-design framework that introduces, for the first time, an adaptive block size allocation mechanism across attention heads. This approach integrates lossless block centroid quantization with customized GPU kernels to significantly improve inference accuracy while preserving throughput. Experimental results demonstrate that, at comparable throughput levels, the proposed method achieves up to a 5.43% higher inference accuracy compared to current block-sparse baselines.
📝 Abstract
As large language models scale to longer contexts, loading the growing KV cache during attention computation becomes a critical bottleneck. Previous work has shown that attention computation is dominated by a small subset of tokens. This motivates block sparse attention methods that partition the KV cache into fixed-size blocks and selectively compute attention over those blocks exhibiting high importance. However, these methods assign a uniform block size across all attention heads, implicitly assuming homogeneous behavior throughout the model. Our analysis reveals that this assumption is flawed: attention heads exhibit widely varying sensitivity to block granularity, and uniformity leads to suboptimal accuracy. We present AB-Sparse, a training-free algorithm-system co-designed framework that improves accuracy while preserving throughput. AB-Sparse introduces lightweight adaptive block size allocation across attention heads to improve accuracy. To compensate for the additional memory overhead, it further employs lossless block centroid quantization. In addition, custom GPU kernels are developed to support efficient execution with variable block sizes. Evaluation results demonstrate that AB-Sparse achieves an accuracy improvement of up to 5.43% over existing block sparse attention baselines without throughput overhead.
Problem

Research questions and friction points this paper is trying to address.

sparse attention
block size
attention heads
long-context inference
KV cache
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive block size
sparse attention
KV cache optimization
training-free
GPU kernel