🤖 AI Summary
DiT-based diffusion models suffer from quadratic computational complexity in self-attention, hindering efficient high-resolution image generation and deployment on resource-constrained devices. To address this, we propose the Multimodal Diffusion Transformer (MM-DiTs), an efficient architecture featuring: (1) a novel linear-complexity attention mechanism—Multi-Level Convolutional Modulated Attention—that aggregates keys/values spatially and modulates queries via learnable convolutions, achieving O(N) complexity; (2) a hybrid attention paradigm wherein image–image branches employ linear attention for efficiency, while text–image branches retain standard quadratic attention to preserve cross-modal alignment fidelity; and (3) knowledge distillation for further model compression. Evaluated on PixArt-Sigma and Stable Diffusion 3.5-Medium, MM-DiTs achieves up to 2.2× inference speedup with negligible degradation in FID and CLIP-Score, significantly enhancing both high-resolution generation efficiency and practical deployability.
📝 Abstract
Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation with higher resolution or on devices with limited resources. This work introduces an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks in conventional DiTs and Multimodal DiTs (MM-DiTs). First, we present a novel linear compressed attention method that uses a multi-layer convolutional network to modulate queries with local information while keys and values are spatially aggregated. Second, we formulate a hybrid attention scheme for multi-modal inputs that combines linear attention for image-to-image interactions and standard scaled dot-product attention for interactions involving prompts. Merging these two approaches leads to an expressive, linear-time Multimodal Efficient Diffusion Transformer (MM-EDiT). We demonstrate the effectiveness of the EDiT and MM-EDiT architectures by integrating them into PixArt-Sigma(conventional DiT) and Stable Diffusion 3.5-Medium (MM-DiT), achieving up to 2.2x speedup with comparable image quality after distillation.