E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models exhibit strong image generation capabilities but suffer from excessive parameter counts, high training costs, and significant inference latency. To address these limitations—particularly for resource-constrained deployment—this paper introduces E-MMDiT, an efficient multimodal diffusion Transformer with only 304 million parameters. Key innovations include a high-compression visual tokenizer, multi-path token compression, Position Reinforcement for enhanced spatial awareness, Alternating Sub-region Attention (ASA) to reduce quadratic attention complexity, and a lightweight AdaLN-affine modulation module. These techniques collectively minimize computational overhead without sacrificing fidelity. E-MMDiT completes training on 25 million samples in just 1.5 days using a single node with 8× AMD MI300X accelerators. On the GenEval benchmark, it achieves 0.66 for 512px image generation, improving to 0.72 after GRPO-based optimization—demonstrating a favorable trade-off between efficiency and generation quality.

Technology Category

Application Category

📝 Abstract
Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.
Problem

Research questions and friction points this paper is trying to address.

Developing efficient multimodal diffusion models for fast image synthesis
Reducing computational costs and training resource requirements
Maintaining spatial coherence while compressing visual tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient multimodal diffusion transformer with 304M parameters
Compressive tokenizer and multi-path compression module
Alternating subregion attention and AdaLN-affine modulation
🔎 Similar Papers
No similar papers found.
Tong Shen
Tong Shen
Process Engineer, Enex International Inc.
J
Jingai Yu
Advanced Micro Devices, Inc.
D
Dong Zhou
Advanced Micro Devices, Inc.
D
Dong Li
Advanced Micro Devices, Inc.
Emad Barsoum
Emad Barsoum
AMD, Columbia University
Generative AIFoundation ModelsAgentic AIComputer VisionML Frameworks