🤖 AI Summary
This work challenges the strong reliance of diffusion models on self-attention mechanisms and, for the first time, systematically reveals a pronounced locality pattern in the attention maps of pre-trained diffusion models. To address this, we propose a novel “attention distillation to convolution” paradigm: we design a pyramid-style convolutional module (ΔConvBlock) that performs inter-layer attention pattern analysis and structured distillation to losslessly transfer self-attention functionality to lightweight convolutional structures. Coupled with a frozen fine-tuning strategy, our approach maintains high-fidelity image generation while achieving visual quality comparable to Transformer-based baselines. Experiments demonstrate a 6,929× reduction in computational cost and a 5.42× speedup in inference latency over LinFusion. Our method establishes a scalable, low-overhead pathway for efficient diffusion modeling without sacrificing perceptual quality.
📝 Abstract
Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention with quadratic computational complexity to handle global spatial relationships in complex images, thereby synthesizing high-fidelity images with coherent visual semantics.Contrary to conventional wisdom, our systematic layer-wise analysis reveals an interesting discrepancy: self-attention in pre-trained diffusion models predominantly exhibits localized attention patterns, closely resembling convolutional inductive biases. This suggests that global interactions in self-attention may be less critical than commonly assumed.Driven by this, we propose (Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((Delta)ConvBlocks).By distilling attention patterns into localized convolutional operations while keeping other components frozen, (Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$ imes$ and surpassing LinFusion by 5.42$ imes$ in efficiency--all without compromising generative fidelity.