SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

๐Ÿ“… 2026-02-13
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing sparse attention methods struggle to simultaneously maintain generation quality and computational efficiency under high sparsity, and training-free strategies face inherent performance limitations. This work proposes a trainable sparse attention mechanism that innovatively integrates Top-k and Top-p masking strategies to enhance robustness. By combining an efficient implementation architecture with a distillation-inspired fine-tuning objective, the method effectively preserves generation fidelity. Evaluated on video diffusion models, it achieves 95% attention sparsity and a 16.2ร— speedup, substantially outperforming current state-of-the-art approaches.

Technology Category

Application Category

๐Ÿ“ Abstract
Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rules, i.e., Top-k and Top-p, fail, and how can we avoid these failures? (2) why can trainable sparse attention reach higher sparsity than training-free methods? (3) what are the limitations of fine-tuning sparse attention using the diffusion loss, and how can we address them? Based on this analysis, we propose SpargeAttention2, a trainable sparse attention method that achieves high sparsity without degrading generation quality. SpargeAttention2 includes (i) a hybrid masking rule that combines Top-k and Top-p for more robust masking at high sparsity, (ii) an efficient trainable sparse attention implementation, and (iii) a distillation-inspired fine-tuning objective to better preserve generation quality during fine-tuning using sparse attention. Experiments on video diffusion models show that SpargeAttention2 reaches 95% attention sparsity and a 16.2x attention speedup while maintaining generation quality, consistently outperforming prior sparse attention methods.
Problem

Research questions and friction points this paper is trying to address.

sparse attention
diffusion models
trainable sparsity
attention acceleration
generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

trainable sparse attention
hybrid masking
distillation fine-tuning
diffusion models
attention sparsity
๐Ÿ”Ž Similar Papers
No similar papers found.