Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Masked Diffusion Models (MDMs) suffer from training-inference misalignment: training employs single-step BERT-style masked token prediction, whereas inference relies on multi-step, scheduler-driven iterative decoding—leaving the scheduling policy unoptimized during training. This work is the first to formulate MDM’s multi-step token decoding as a unified Markov Decision Process (MDP), enabling joint trajectory-level optimization of both model parameters and decoding scheduling policies. Our method leverages Group Relative Policy Optimization (GRPO) to achieve reward-driven, end-to-end joint training without backpropagating through the multi-step generation process. This paradigm significantly improves training-inference alignment and consistently outperforms existing MDM approaches across four major benchmarks—ImageReward, HPS, GenEval, and DPG-Bench—demonstrating substantial gains in generation quality.

Technology Category

Application Category

📝 Abstract
Recently, Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation. However, a notable discrepancy exists between their training and inference procedures. In particular, MDM inference is a multi-step, iterative process governed not only by the model itself but also by various schedules that dictate the token-decoding trajectory (e.g., how many tokens to decode at each step). In contrast, MDMs are typically trained using a simplified, single-step BERT-style objective that masks a subset of tokens and predicts all of them simultaneously. This step-level simplification fundamentally disconnects the training paradigm from the trajectory-level nature of inference, leaving the inference schedules never optimized during training. In this paper, we introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule. By applying Group Relative Policy Optimization at the trajectory level, Co-GRPO cooperatively optimizes model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through the multi-step generation process. This holistic optimization aligns training with inference more thoroughly and substantially improves generation quality. Empirical results across four benchmarks-ImageReward, HPS, GenEval, and DPG-Bench-demonstrate the effectiveness of our approach. For more details, please refer to our project page: https://co-grpo.github.io/ .
Problem

Research questions and friction points this paper is trying to address.

Optimizes masked diffusion models' training-inference discrepancy
Unifies model and schedule via Markov Decision Process
Enhances generation quality without multi-step backpropagation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates MDM generation as a unified Markov Decision Process
Jointly optimizes model and schedule parameters via Group Relative Policy Optimization
Aligns training with inference without backpropagation through multi-step generation
🔎 Similar Papers
No similar papers found.
R
Renping Zhou
Leap Lab, Tsinghua University
Zanlin Ni
Zanlin Ni
Tsinghua University
Computer VisionDeep Learning
T
Tianyi Chen
Leap Lab, Tsinghua University
Z
Zeyu Liu
Leap Lab, Tsinghua University
Y
Yang Yue
Leap Lab, Tsinghua University
Yulin Wang
Yulin Wang
Shanghai Jiao Tong University
Y
Yuxuan Wang
Leap Lab, Tsinghua University
J
Jingshu Liu
Leap Lab, Tsinghua University
G
Gao Huang
Leap Lab, Tsinghua University