MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing autoregressive reasoning-aware generation methods suffer from error propagation, leading to poor semantic alignment between generated reasoning text and images. To address this, we propose MMaDA-Parallel—the first parallel multimodal diffusion framework enabling continuous, bidirectional interaction between text and image throughout the entire denoising process. We further introduce ParaBench, the first benchmark specifically designed to evaluate reasoning–image alignment. Additionally, we develop a trajectory-level semantic reward-driven parallel reinforcement learning (ParaRL) strategy to achieve cross-modal dynamic co-optimization. Compared to the state-of-the-art Bagel model, MMaDA-Parallel achieves a 6.9% improvement in alignment accuracy on ParaBench, significantly enhancing consistency and faithfulness between chain-of-thought reasoning and corresponding visual content.

Technology Category

Application Category

📝 Abstract

While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

Problem

Research questions and friction points this paper is trying to address.

Addresses performance degradation in thinking-aware generation from error propagation

Analyzes poor alignment between generated reasoning and final image outputs

Improves cross-modal consistency in text-to-image synthesis through parallel diffusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel multimodal diffusion framework for bidirectional interaction

Supervised finetuning with Parallel Reinforcement Learning optimization

Semantic rewards enforce cross-modal consistency along trajectory

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)