🤖 AI Summary
Diffusion large language models (dLLMs) significantly underperform autoregressive models on reasoning tasks, and no principled reinforcement learning (RL) algorithms exist that align with their denoising generation mechanism. Method: We propose the first principled RL fine-tuning framework for dLLMs—Distribution-Matching Policy Optimization (DMPO)—which minimizes the cross-entropy between the model’s output distribution and a reward-weighted optimal distribution. To mitigate high variance in mini-batch training, DMPO introduces a weight-based baseline subtraction, eliminating reliance on supervised fine-tuning. Contribution/Results: Experiments demonstrate that DMPO substantially surpasses prior state-of-the-art methods across multiple reasoning benchmarks, achieving up to a 42.9% absolute accuracy gain over previous SOTA and a 55.8% improvement over the base dLLM. This work provides the first empirical validation of the effectiveness and scalability of pure RL—without supervised initialization—for enhancing dLLM reasoning capabilities.
📝 Abstract
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $42.9%$ over previously SOTA baselines and $55.8%$ over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at https://github.com/yuchen-zhu-zyc/DMPO.