DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

๐Ÿ“… 2025-03-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Diffusion models suffer from computational redundancy and performance bottlenecks due to uniform token processing across varying noise levels and sample complexities. To address this, we propose a dynamic token selection mechanism featuring a novel batch-level global token pool and a noise-aware capacity predictor, enabling on-demand allocation of computational resources. Integrated within a diffusion Transformer architecture, our approach employs a Mixture-of-Experts (MoE) module with dynamic routing, endowing expert subnetworks with both global distribution modeling capability and noise-adaptive behavior. On ImageNet, our method achieves state-of-the-art performance among diffusion models, surpassing dense models with three times the activation parameters using only one times the activation parametersโ€”and outperforming existing MoE-based diffusion methods. Furthermore, it generalizes effectively to text-to-image generation, delivering significant improvements in inference efficiency and cross-task generalization.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: https://shiml20.github.io/DiffMoE/
Problem

Research questions and friction points this paper is trying to address.

Addresses uniform input processing in diffusion models
Introduces dynamic token selection for scalable diffusion transformers
Improves performance in image and text-to-image generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token selection via global token pool
Capacity predictor for adaptive resource allocation
Specialized expert behavior in diffusion models