Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion large language models (dLLMs) significantly underperform autoregressive models on reasoning tasks, and no principled reinforcement learning (RL) algorithms exist that align with their denoising generation mechanism. Method: We propose the first principled RL fine-tuning framework for dLLMs—Distribution-Matching Policy Optimization (DMPO)—which minimizes the cross-entropy between the model’s output distribution and a reward-weighted optimal distribution. To mitigate high variance in mini-batch training, DMPO introduces a weight-based baseline subtraction, eliminating reliance on supervised fine-tuning. Contribution/Results: Experiments demonstrate that DMPO substantially surpasses prior state-of-the-art methods across multiple reasoning benchmarks, achieving up to a 42.9% absolute accuracy gain over previous SOTA and a 55.8% improvement over the base dLLM. This work provides the first empirical validation of the effectiveness and scalability of pure RL—without supervised initialization—for enhancing dLLM reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $42.9%$ over previously SOTA baselines and $55.8%$ over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at https://github.com/yuchen-zhu-zyc/DMPO.
Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning capabilities of diffusion large language models
Developing reinforcement learning for diffusion models' unique characteristics
Matching policy distribution to optimal reward-tilted distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution Matching Policy Optimization for dLLMs
Cross-entropy optimization matches policy distributions
Weight baseline subtraction enables small batch training