Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion Transformers (DiTs) suffer from quadratic computational complexity in self-attention with respect to spatial resolution and temporal length, severely hindering efficient high-fidelity video/image generation. This work proposes a hyper-sparse visual generation framework that retains only 3.1% of tokens while preserving full-attention quality. Our method introduces two key innovations: (i) a novel attention score redistribution mechanism grounded in historical softmax distributions, leveraging temporal redundancy in the diffusion process to correct probability normalization bias induced by sparsification; and (ii) architecture-aware sparse attention design, tightly coupled with DiT backbones (e.g., CogVideoX and PixArt). Experiments demonstrate that our approach reduces end-to-end latency by 45% and self-attention latency by 92% on NVIDIA H100 GPUs, with negligible additional computational overhead.

Technology Category

Application Category

📝 Abstract
Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. % To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. % Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference. Further, we measure latency to show that our method can attain over 45% end-to-end % and over 92% self-attention latency reduction on an H100 GPU at negligible overhead cost. Code available online here: href{https://github.com/cccrrrccc/Re-ttention}{https://github.com/cccrrrccc/Re-ttention}
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of attention in visual generation
Maintains quality at high sparsity levels in sparse attention
Overcomes normalization shift in diffusion model attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages temporal redundancy for sparse attention
Reshapes attention scores using prior softmax history
Achieves high sparsity with negligible overhead cost
🔎 Similar Papers
No similar papers found.