EDiT: Efficient Diffusion Transformers with Linear Compressed Attention

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

DiT-based diffusion models suffer from quadratic computational complexity in self-attention, hindering efficient high-resolution image generation and deployment on resource-constrained devices. To address this, we propose the Multimodal Diffusion Transformer (MM-DiTs), an efficient architecture featuring: (1) a novel linear-complexity attention mechanism—Multi-Level Convolutional Modulated Attention—that aggregates keys/values spatially and modulates queries via learnable convolutions, achieving O(N) complexity; (2) a hybrid attention paradigm wherein image–image branches employ linear attention for efficiency, while text–image branches retain standard quadratic attention to preserve cross-modal alignment fidelity; and (3) knowledge distillation for further model compression. Evaluated on PixArt-Sigma and Stable Diffusion 3.5-Medium, MM-DiTs achieves up to 2.2× inference speedup with negligible degradation in FID and CLIP-Score, significantly enhancing both high-resolution generation efficiency and practical deployability.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation with higher resolution or on devices with limited resources. This work introduces an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks in conventional DiTs and Multimodal DiTs (MM-DiTs). First, we present a novel linear compressed attention method that uses a multi-layer convolutional network to modulate queries with local information while keys and values are spatially aggregated. Second, we formulate a hybrid attention scheme for multi-modal inputs that combines linear attention for image-to-image interactions and standard scaled dot-product attention for interactions involving prompts. Merging these two approaches leads to an expressive, linear-time Multimodal Efficient Diffusion Transformer (MM-EDiT). We demonstrate the effectiveness of the EDiT and MM-EDiT architectures by integrating them into PixArt-Sigma(conventional DiT) and Stable Diffusion 3.5-Medium (MM-DiT), achieving up to 2.2x speedup with comparable image quality after distillation.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic scaling in DiTs for higher resolution images

Improves efficiency on devices with limited computational resources

Enables linear-time multimodal interactions in diffusion transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear compressed attention for efficient queries

Hybrid attention for multi-modal inputs

Linear-time Multimodal Efficient Diffusion Transformer

🔎 Similar Papers

Efficient generative adversarial networks using linear additive-attention Transformers