Sparser Block-Sparse Attention via Token Permutation

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing block-sparse attention methods suffer from insufficient sparsity and computational redundancy due to fixed, non-adaptive block partitioning, while standard self-attention incurs prohibitive O(N²) complexity for long-context reasoning in large language models (LLMs). To address this, we propose Permuted Block-Sparse Attention (PBS-Attn), which employs a learnable token permutation to restructure the input sequence such that salient query-key interactions are concentrated within fewer attention blocks—thereby enhancing block-level sparsity and enabling higher computation skipping rates. PBS-Attn integrates a customized FlashAttention kernel to accelerate sparse attention computation. Experiments on long-context benchmarks demonstrate that PBS-Attn achieves near-full-attention accuracy while delivering up to 2.75× end-to-end inference speedup, significantly outperforming state-of-the-art block-sparse approaches.

Technology Category

Application Category

📝 Abstract
Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention ( extbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75 imes$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn
Problem

Research questions and friction points this paper is trying to address.

Optimizing computational efficiency of self-attention in LLMs
Enhancing block-sparse attention patterns to reduce redundancy
Accelerating long-context prefilling while maintaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Permuted Block-Sparse Attention increases block-level sparsity
Method leverages attention permutation properties for efficiency
Custom kernels achieve speedup in long-context prefilling
🔎 Similar Papers
No similar papers found.