🤖 AI Summary
To address the quadratic computational complexity and high latency of diffusion transformers (DiTs) in video generation caused by long spatiotemporal sequences, this work proposes a fine-grained, tunable sparse-linear hybrid attention mechanism. For the first time, attention weights are dynamically partitioned into three categories—critical (retaining full O(N²) computation), marginal (approximated via O(N) linear attention), and negligible (skipped entirely)—with differentiable gating enabling end-to-end joint optimization. A unified GPU kernel is designed to efficiently fuse sparse, linear, and low-rank attention computations on a single GPU. The method requires only lightweight fine-tuning yet achieves a 95% reduction in attention FLOPs and a 2.2× speedup in end-to-end video generation, while preserving fidelity as measured by FVD and FID.
📝 Abstract
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.