🤖 AI Summary
This work identifies a temporal division of labor between cross-attention and self-attention in text-conditioned diffusion models: cross-attention converges rapidly in early denoising steps, while self-attention dominates fine-grained detail modeling in later steps. Leveraging this observation, we propose Temporal Gated Attention (TGATE), a training-free, plug-and-play method that caches and reuses cross-attention outputs from critical early steps to eliminate redundant computation. TGATE requires no architectural modification or model retraining and is compatible with mainstream text-to-image diffusion models. Experiments demonstrate 10–50% inference speedup across multiple state-of-the-art models—including Stable Diffusion v1.5, SDXL, and PixArt-α—without compromising generation quality, as validated by both quantitative metrics (FID, CLIP-Score) and human evaluation. The implementation is publicly available.
📝 Abstract
We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.