Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing native sparse attention (NSA) kernels rely on large-group grouped-query attention (GQA), making them incompatible with the small-GQA configurations prevalent in modern LLMs—thus limiting sparse acceleration in mainstream models. This work proposes a hardware-aligned, GQA-size-agnostic sparse attention kernel that supports arbitrary GQA group sizes efficiently via query reorganization and fine-grained memory access optimization. Its core innovation lies in decoupling sparse attention computation from GQA group size constraints, thereby improving GPU utilization and compatibility with diverse sparsity patterns. Experiments demonstrate up to 3.5× latency reduction in the attention kernel, 1.09× end-to-end training speedup, and 1.36× prefill-phase acceleration—significantly enhancing both long-context inference and training efficiency.

Technology Category

Application Category

📝 Abstract
Recent progress in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), a state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance gains while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA relies on a query-grouping strategy that is efficient only with large Grouped Query Attention (GQA) sizes, whereas modern LLMs typically adopt much smaller GQA groups, which limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), which includes an alternative kernel design that enables efficient NSA computation across a wide range of popular LLMs with varied smaller GQA group sizes on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5$ imes$ and on average 1.6$ imes$ kernel-level latency reduction, (ii) up to 1.25$ imes$ and 1.09$ imes$ on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36$ imes$ and 1.11$ imes$ on average end-to-end prefill speedup on state-of-the-art LLMs. The source code is open-sourced and publicly available at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.
Problem

Research questions and friction points this paper is trying to address.

Improving sparse attention efficiency for small GQA sizes
Enabling native sparse attention on modern GPUs broadly
Overcoming limitations of query-grouping in NSA kernels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Alternative kernel design for efficient sparse attention
Enables computation across varied smaller GQA groups
Achieves significant latency reduction and speedup gains
🔎 Similar Papers
No similar papers found.
Ran Yan
Ran Yan
University of California, Los Angeles
Medical Imaging
Y
Youhe Jiang
The Hong Kong University of Science and Technology
B
Binhang Yuan
The Hong Kong University of Science and Technology