MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Diffusion language models suffer from structural inconsistency between training (random masking) and inference (iterative denoising), degrading performance. To address this, we formulate diffusion denoising as a sequential decision-making process and propose MDPO—a Markov Decision Process (MDP)-based framework that optimizes denoising trajectories via policy gradient methods, thereby aligning training and inference dynamics. Additionally, we introduce RCR, a training-free remasking strategy that enhances dynamic correction capability during generation. Experiments show MDPO achieves state-of-the-art performance with only 1/60 the gradient updates required by baselines, yielding average improvements of 9.6% on MATH500 and 54.2% on Countdown; RCR delivers consistent further gains. Our core contribution lies in unifying diffusion language modeling as a controllable sequential decision problem and bridging the training-inference gap through lightweight, principled mechanisms.

Technology Category

Application Category

📝 Abstract

Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This simple yet effective training-free strategy, what we refer to as RCR, consistently improves performance and yields additional gains when combined with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: https://github.com/autonomousvision/mdpo. Project Page: https://cli212.github.io/MDPO/.

Problem

Research questions and friction points this paper is trying to address.

Addresses training-inference mismatch in masked diffusion models

Improves denoising via reinforcement learning for better generation

Enhances token refinement flexibility with remasking strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for denoising trajectories optimization

Markov property exploitation in diffusion policy

Improved remasking strategy for flexible token refinement

🔎 Similar Papers

TEncDM: Understanding the Properties of the Diffusion Model in the Space of Language Model Encodings