Faster Diffusion via Temporal Attention Decomposition

📅 2024-04-03
📈 Citations: 11
Influential: 2
📄 PDF
🤖 AI Summary
This work identifies a temporal division of labor between cross-attention and self-attention in text-conditioned diffusion models: cross-attention converges rapidly in early denoising steps, while self-attention dominates fine-grained detail modeling in later steps. Leveraging this observation, we propose Temporal Gated Attention (TGATE), a training-free, plug-and-play method that caches and reuses cross-attention outputs from critical early steps to eliminate redundant computation. TGATE requires no architectural modification or model retraining and is compatible with mainstream text-to-image diffusion models. Experiments demonstrate 10–50% inference speedup across multiple state-of-the-art models—including Stable Diffusion v1.5, SDXL, and PixArt-α—without compromising generation quality, as validated by both quantitative metrics (FID, CLIP-Score) and human evaluation. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.
Problem

Research questions and friction points this paper is trying to address.

Optimizing attention in diffusion models
Accelerating image generation process
Enhancing efficiency without additional training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal gating optimizes attention
Cross-attention initializes visual semantics
Self-attention enhances image fidelity
🔎 Similar Papers
No similar papers found.
W
Wentian Zhang
AI Initiative, King Abdullah University of Science And Technology (KAUST)
Haozhe Liu
Haozhe Liu
KAUST
Computer VisionReinforcement LearningMultimodalImage/Video Generation
Jinheng Xie
Jinheng Xie
National University of Singapore
Deep LearningComputer VisionGenerative AI
Francesco Faccio
Francesco Faccio
Senior Research Scientist, Google DeepMind
Reinforcement LearningDeep LearningNeural Networks
M
Mike Zheng Shou
Show Lab, National University of Singapore (NUS)
J
Jürgen Schmidhuber
Swiss AI Lab, IDSIA, USI & SUPSI, Lugano, Switzerland