🤖 AI Summary
Diffusion Transformers (DiTs) suffer from quadratic computational complexity in self-attention with respect to spatial resolution and temporal length, severely hindering efficient high-fidelity video/image generation. This work proposes a hyper-sparse visual generation framework that retains only 3.1% of tokens while preserving full-attention quality. Our method introduces two key innovations: (i) a novel attention score redistribution mechanism grounded in historical softmax distributions, leveraging temporal redundancy in the diffusion process to correct probability normalization bias induced by sparsification; and (ii) architecture-aware sparse attention design, tightly coupled with DiT backbones (e.g., CogVideoX and PixArt). Experiments demonstrate that our approach reduces end-to-end latency by 45% and self-attention latency by 92% on NVIDIA H100 GPUs, with negligible additional computational overhead.
📝 Abstract
Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45% end-to-end % and over 92% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: href{https://github.com/cccrrrccc/Re-ttention}{https://github.com/cccrrrccc/Re-ttention}