MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive reasoning-aware generation methods suffer from error propagation, leading to poor semantic alignment between generated reasoning text and images. To address this, we propose MMaDA-Parallel—the first parallel multimodal diffusion framework enabling continuous, bidirectional interaction between text and image throughout the entire denoising process. We further introduce ParaBench, the first benchmark specifically designed to evaluate reasoning–image alignment. Additionally, we develop a trajectory-level semantic reward-driven parallel reinforcement learning (ParaRL) strategy to achieve cross-modal dynamic co-optimization. Compared to the state-of-the-art Bagel model, MMaDA-Parallel achieves a 6.9% improvement in alignment accuracy on ParaBench, significantly enhancing consistency and faithfulness between chain-of-thought reasoning and corresponding visual content.

Technology Category

Application Category

📝 Abstract
While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel
Problem

Research questions and friction points this paper is trying to address.

Addresses performance degradation in thinking-aware generation from error propagation
Analyzes poor alignment between generated reasoning and final image outputs
Improves cross-modal consistency in text-to-image synthesis through parallel diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel multimodal diffusion framework for bidirectional interaction
Supervised finetuning with Parallel Reinforcement Learning optimization
Semantic rewards enforce cross-modal consistency along trajectory
🔎 Similar Papers
No similar papers found.
Y
Ye Tian
Peking University
Ling Yang
Ling Yang
Postdoc@Princeton University, PhD@Peking University
LLMDiffusion ModelsReinforcement LearningComplex Data Modeling
J
Jiongfan Yang
Peking University
A
Anran Wang
ByteDance
Y
Yu Tian
ByteDance
J
Jiani Zheng
ByteDance
H
Haochen Wang
ByteDance, CASIA
Zhiyang Teng
Zhiyang Teng
Bytedance SG
Natural Language Processing
Z
Zhuochen Wang
ByteDance
Yinjie Wang
Yinjie Wang
University of Chicago
Statistics
Yunhai Tong
Yunhai Tong
Peking University
DataMining
M
Mengdi Wang
Princeton University
Xiangtai Li
Xiangtai Li
Research Scientist, Tiktok, SG; MMLab@NTU
Generative AIComputer Vision