SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address the quadratic computational complexity and high latency of diffusion transformers (DiTs) in video generation caused by long spatiotemporal sequences, this work proposes a fine-grained, tunable sparse-linear hybrid attention mechanism. For the first time, attention weights are dynamically partitioned into three categories—critical (retaining full O(N²) computation), marginal (approximated via O(N) linear attention), and negligible (skipped entirely)—with differentiable gating enabling end-to-end joint optimization. A unified GPU kernel is designed to efficiently fuse sparse, linear, and low-rank attention computations on a single GPU. The method requires only lightweight fine-tuning yet achieves a 95% reduction in attention FLOPs and a 2.2× speedup in end-to-end video generation, while preserving fidelity as measured by FVD and FID.

Technology Category

Application Category

📝 Abstract

In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic attention complexity in diffusion transformers for efficiency

Accelerates video generation by classifying attention weights into sparse categories

Maintains generation quality while achieving 20x computation reduction through fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses sparse and linear attention for acceleration

Classifies weights into critical, marginal, negligible categories

Implements single GPU kernel for efficient computation

🔎 Similar Papers

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models