SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

📅 2026-02-13

📈 Citations: 1

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing sparse attention methods struggle to simultaneously maintain generation quality and computational efficiency under high sparsity, and training-free strategies face inherent performance limitations. This work proposes a trainable sparse attention mechanism that innovatively integrates Top-k and Top-p masking strategies to enhance robustness. By combining an efficient implementation architecture with a distillation-inspired fine-tuning objective, the method effectively preserves generation fidelity. Evaluated on video diffusion models, it achieves 95% attention sparsity and a 16.2× speedup, substantially outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Many training-free sparse attention methods are effective for accelerating diffusion models. Recently, several works suggest that making sparse attention trainable can further increase sparsity while preserving generation quality. We study three key questions: (1) when do the two common masking rules, i.e., Top-k and Top-p, fail, and how can we avoid these failures? (2) why can trainable sparse attention reach higher sparsity than training-free methods? (3) what are the limitations of fine-tuning sparse attention using the diffusion loss, and how can we address them? Based on this analysis, we propose SpargeAttention2, a trainable sparse attention method that achieves high sparsity without degrading generation quality. SpargeAttention2 includes (i) a hybrid masking rule that combines Top-k and Top-p for more robust masking at high sparsity, (ii) an efficient trainable sparse attention implementation, and (iii) a distillation-inspired fine-tuning objective to better preserve generation quality during fine-tuning using sparse attention. Experiments on video diffusion models show that SpargeAttention2 reaches 95% attention sparsity and a 16.2x attention speedup while maintaining generation quality, consistently outperforming prior sparse attention methods.

Problem

Research questions and friction points this paper is trying to address.

sparse attention

diffusion models

trainable sparsity

attention acceleration

generation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

trainable sparse attention

hybrid masking

distillation fine-tuning