Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

To address poor scalability, uneven expert utilization, and shallow-layer training difficulties in diffusion Transformers with Mixture-of-Experts (MoE), this paper proposes Expert Race—a dynamic sparse routing mechanism. It introduces a joint, race-style matching between tokens and experts to enable precise expert assignment for critical tokens. Additionally, we design layer-adaptive regularization and a router similarity loss to mitigate mode collapse and accelerate convergence in shallow layers. Notably, Expert Race is the first MoE routing framework to incorporate competitive racing principles. Evaluated on ImageNet image generation, it significantly improves FID scores and training stability, while increasing expert utilization by 37%. These results demonstrate superior scalability and generalization capability compared to prior MoE approaches in diffusion-based generative modeling.

Technology Category

Application Category

📝 Abstract

Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.

Problem

Research questions and friction points this paper is trying to address.

Enhancing scalability and performance of diffusion transformers

Dynamic expert assignment to critical tokens

Preventing mode collapse and improving expert utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible routing strategy for MoE diffusion transformers

Dynamic expert assignment via token-expert competition

Layer regularization and router loss prevent mode collapse

🔎 Similar Papers

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing