MMaDA: Multimodal Large Diffusion Language Models

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper introduces MMaDA—the first multimodal foundation model based on a unified diffusion architecture, designed to jointly address text reasoning, cross-modal understanding, and text-to-image generation. Methodologically: (1) it employs a modality-agnostic diffusion backbone, eliminating modality-specific components; (2) it proposes hybrid long-chain chain-of-thought fine-tuning to achieve cross-task cognitive alignment; and (3) it introduces UniGRPO, a unified policy-gradient algorithm tailored for diffusion models, integrating diverse reward modeling and multi-stage cold-start training. Experiments demonstrate that MMaDA-8B surpasses LLaMA-3-7B in text reasoning, outperforms Show-o and SEED-X in multimodal understanding, and achieves state-of-the-art fidelity and controllability in text-to-image generation over SDXL and Janus. The code and model are publicly released.

Technology Category

Application Category

📝 Abstract

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

Problem

Research questions and friction points this paper is trying to address.

Unified multimodal diffusion model for diverse tasks

Enhancing reasoning and generation with mixed CoT strategy

Unified RL algorithm for consistent performance improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified diffusion architecture for multimodal processing

Mixed long chain-of-thought fine-tuning strategy

UniGRPO policy-gradient RL algorithm for diffusion models

🔎 Similar Papers

No similar papers found.

Authors to Follow