EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

📅 2024-10-02
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scalability bottleneck of Diffusion Transformers (DiTs) in text-to-image generation, where conventional dense architectures and fixed-routing Mixture-of-Experts (MoE) models struggle to scale efficiently to billion-parameter regimes. We propose an adaptive Expert-Choice MoE architecture explicitly designed for the diffusion process—introducing, for the first time, a fine-grained, end-to-end differentiable dynamic routing mechanism into DiTs. Our method allocates computational resources in real time based on the complexity of text–image pairs, enabling principled scaling beyond prior limits. The resulting 97B-parameter model achieves 71.68% accuracy on GenEval—the new state-of-the-art—while accelerating training convergence and significantly improving generation quality, text–image alignment, and inference efficiency. Crucially, the routing mechanism is interpretable, providing transparent, instance-aware computation allocation.

Technology Category

Application Category

📝 Abstract
Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.
Problem

Research questions and friction points this paper is trying to address.

Scaling diffusion transformers for text-to-image synthesis
Optimizing compute allocation with expert-choice routing
Achieving state-of-the-art text-to-image alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Expert-Choice Routing
Mixture-of-Experts Models
Heterogeneous Computation Optimization
🔎 Similar Papers
No similar papers found.
Haotian Sun
Haotian Sun
Georgia Institute of Technology
Machine LearningFoundation Models
T
Tao Lei
Apple AI/ML
B
Bowen Zhang
Apple AI/ML
Yanghao Li
Yanghao Li
Apple
Computer Vision
H
Haoshuo Huang
Apple AI/ML
Ruoming Pang
Ruoming Pang
Apple AI/ML
Deep learning
B
Bo Dai
Georgia Institute of Technology
N
Nan Du
Apple AI/ML