Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing Mixture-of-Experts (MoE) architectures in diffusion Transformers (DiTs) yield limited performance gains, primarily due to spatial redundancy and functional heterogeneity among visual tokens, which impede expert specialization. To address this, we propose ProMoE—a novel framework introducing the first explicit semantic-guided two-stage routing mechanism that jointly leverages conditional routing and learnable prototypical routing, augmented by a dedicated routing contrastive loss to enhance intra-expert consistency and inter-expert diversity. ProMoE operates in the latent space to achieve fine-grained, semantically aware token-to-expert assignment. Extensive experiments demonstrate that ProMoE significantly outperforms existing MoE-DiT methods on ImageNet, while remaining compatible with both DDPM and Rectified Flow objectives. These results empirically validate the critical role of explicit, semantics-driven routing in vision-oriented MoE models.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Addresses limited gains of MoE in Diffusion Transformers due to visual token characteristics

Develops explicit routing guidance to partition tokens by function and semantics

Enhances expert specialization through conditional routing and prototypical routing mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step router with explicit routing guidance

Partition tokens via conditional and prototypical routing

Routing contrastive loss enhances intra-expert coherence diversity

🔎 Similar Papers

EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing