Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Diffusion Transformers (DiTs) suffer from quadratic self-attention complexity O(n²), hindering scalable long-sequence image generation. Existing Top-K sparsity methods remain limited by quadratic selection overhead in compressed layers and the inflation of K with sequence length—rooted in single-layer coarse-grained modeling that cannot simultaneously preserve global structure and local detail. To address this, we propose Trainable Log-Linear Sparse Attention (LLSA), the first method to integrate hierarchical Top-K selection with layered key-value (KV) enhancement, achieving O(n log n) attention complexity under end-to-end pixel-space training. Our approach comprises: (i) hierarchical sparse indexing, (ii) trainable Top-K selection, (iii) layered KV compression-reconstruction, and (iv) a dense-mask-free, fully sparse GPU kernel. On 256×256 images, LLSA delivers 28.27× inference speedup and 6.09× training acceleration with no quality degradation—enabling, for the first time, efficient high-resolution, pixel-level DiT training on long sequences.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: https://github.com/SingleZombie/LLSA

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic self-attention cost in Diffusion Transformers

Enables efficient scaling to extremely long token sequences

Maintains generation quality while accelerating inference and training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Top-K selection reduces quadratic to log-linear complexity

Hierarchical KV Enrichment preserves global context with fewer tokens

GPU implementation uses sparse indices, eliminating dense attention masks

🔎 Similar Papers

SparseDM: Toward Sparse Efficient Diffusion Models