BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the computational and memory bottlenecks of standard attention in long-context reasoning for large language models (LLMs), this paper proposes a plug-and-play dynamic sparse attention method. Unlike prior approaches, it requires no precomputation or proxy scores; instead, it identifies and skips negligible attention terms *online* during Softmax computation using an adaptive threshold inversely proportional to context length—enabling zero-overhead block-level pruning for the first time. The method is fully compatible with MHA, GQA, MQA, and MLA architectures, supports both prefill and decode phases, and integrates seamlessly with FlashAttention kernels and sparse-aware training. Experiments show 1.62× speedup during prefill (74.7% sparsity) and 1.48× during decode (73.2% sparsity), with negligible accuracy degradation. Furthermore, sparse-aware training extends the accuracy–sparsity Pareto frontier.

Technology Category

Application Category

📝 Abstract
The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the standard attention mechanism. To address this challenge, we introduce BLASST, a drop-in sparse attention method that dynamically prunes the attention matrix without any pre-computation or proxy scores. Our method uses a fixed threshold and existing information from online softmax to identify negligible attention scores, skipping softmax computation, Value block loading, and the subsequent matrix multiplication. This fits seamlessly into existing FlashAttention kernel designs with negligible latency overhead. The approach is applicable to both prefill and decode stages across all attention variants (MHA, GQA, MQA, and MLA), providing a unified solution for accelerating long-context inference. We develop an automated calibration procedure that reveals a simple inverse relationship between optimal threshold and context length, enabling robust deployment across diverse scenarios. Maintaining high accuracy, we demonstrate a 1.62x speedup for prefill at 74.7% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs. Furthermore, we explore sparsity-aware training as a natural extension, showing that models can be trained to be inherently more robust to sparse attention patterns, pushing the accuracy-sparsity frontier even further.
Problem

Research questions and friction points this paper is trying to address.

Dynamic sparse attention to reduce computational bottlenecks
Accelerates long-context inference without pre-computation or proxy scores
Maintains high accuracy while speeding up prefill and decode stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic pruning of attention matrix using softmax thresholding
Seamless integration into FlashAttention kernels with minimal overhead
Automated calibration for optimal threshold across varying context lengths
🔎 Similar Papers
No similar papers found.
Jiayi Yuan
Jiayi Yuan
Rice University
Machine LearningLarge Language Models
C
Cameron Shinn
University of California, Davis, California, USA
K
Kai Xu
NVIDIA, Santa Clara, California, USA
J
Jingze Cui
NVIDIA, Santa Clara, California, USA
G
George Klimiashvili
NVIDIA, Santa Clara, California, USA
Guangxuan Xiao
Guangxuan Xiao
Ph.D. candidate, MIT
Deep LearningMachine Learning
P
Perkz Zheng
NVIDIA, Santa Clara, California, USA
B
Bo Li
NVIDIA, Santa Clara, California, USA
Yuxin Zhou
Yuxin Zhou
University of California, Riverside
CombustionNanoparticlesMolecular dynamicsAerosol
Z
Zhouhai Ye
NVIDIA, Santa Clara, California, USA
W
Weijie You
NVIDIA, Santa Clara, California, USA
T
Tian Zheng
NVIDIA, Santa Clara, California, USA
D
Dominic Brown
NVIDIA, Santa Clara, California, USA
P
Pengbo Wang
NVIDIA, Santa Clara, California, USA
R
Richard Cai
NVIDIA, Santa Clara, California, USA
J
Julien Demouth
NVIDIA, Santa Clara, California, USA
J
John D. Owens
University of California, Davis, California, USA
Xia Hu
Xia Hu
Google DeepMind
Deep LearningMachine LearningMultimodal
S
Song Han
NVIDIA, Santa Clara, California, USA
T
Timmy Liu
NVIDIA, Santa Clara, California, USA
Huizi Mao
Huizi Mao
OmniML, Inc.
Deep learningComputer Architecture