Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of integrating discrete diffusion models (DDMs) into reinforcement learning (RL) under the non-autoregressive paradigm. We propose MaskGRPO, the first RL framework with a scalable theoretical foundation for DDMs. It introduces a modality-aware importance sampling mechanism, a vision-specific rollout strategy, and token-level volatility modeling to improve gradient estimation quality. Unlike existing RL methods incompatible with DDM architectures (e.g., GRPO), MaskGRPO enables joint policy optimization over both textual and visual sequences. Experiments demonstrate significant performance gains on mathematical reasoning, code generation, and multimodal visual synthesis tasks, alongside enhanced training stability and superior output quality.

Technology Category

Application Category

📝 Abstract
Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.
Problem

Research questions and friction points this paper is trying to address.

Optimizing discrete diffusion models with reinforcement learning rewards
Enabling scalable multimodal reinforcement learning for discrete diffusion
Improving reasoning performance and generation quality across benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

MaskGRPO enables scalable multimodal reinforcement learning
Importance estimator captures token fluctuation for gradients
Tailored rollout method yields diverse visual completions
🔎 Similar Papers
No similar papers found.
T
Tianren Ma
University of Chinese Academy of Sciences
Mu Zhang
Mu Zhang
University of Chinese Academy of Sciences
Y
Yibing Wang
University of Chinese Academy of Sciences
Qixiang Ye
Qixiang Ye
University of Chinese Academy of Sciences, University of Maryland
Visual Object DetectionImage Processing