Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion Transformers (DiTs) achieve state-of-the-art performance in text-to-image generation but suffer from high inference overhead due to their large parameter count. Existing pruning methods often incur severe performance degradation due to excessive capacity loss. To address this, we propose the first structured sparsification of dense DiTs into a Mixture-of-Experts (MoE) architecture. Our approach replaces feed-forward networks (FFNs) with MoE layers and introduces a novel “Mixture of Blocks” mechanism for dynamic, block-level expert activation. We further integrate Taylor criterion–based initialization, multi-stage knowledge distillation, and load-balancing constraints to preserve model capacity and training stability. Additionally, we design a grouped feature consistency loss to enhance generative fidelity. Evaluated on FLUX.1 [dev], our method achieves 60% reduction in activated parameters while matching the original DiT’s generation quality—substantially outperforming leading pruning baselines and establishing a new Pareto-optimal trade-off between inference efficiency and image quality.

Technology Category

Application Category

📝 Abstract
Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.
Problem

Research questions and friction points this paper is trying to address.

Reducing DiT inference overhead by restructuring to Mixture of Experts
Maintaining model capacity while activating fewer parameters
Overcoming performance degradation from aggressive pruning methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms dense DiT into Mixture of Experts structure
Replaces FFNs with MoE layers reducing activated parameters
Uses multi-step distillation pipeline for effective conversion
🔎 Similar Papers
No similar papers found.
Y
Youwei Zheng
Sun Yat-sen University
Y
Yuxi Ren
ByteDance Seed Vision
X
Xin Xia
ByteDance Seed Vision
Xuefeng Xiao
Xuefeng Xiao
ByteDance Seed
Computer VisionEfficient AI
X
Xiaohua Xie
Sun Yat-sen University