Relative Score Policy Optimization for Diffusion Language Models

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
Diffusion language models struggle to enhance reasoning capabilities via reinforcement learning due to the absence of tractable sequence-level log-likelihood ratios, forcing existing methods to rely on high-variance approximations. This work proposes Relative Score Policy Optimization (RSPO), which reframes reward advantages as target values for relative policy log-ratios and leverages verifiable rewards to calibrate noisy likelihood estimates in diffusion models, thereby circumventing direct reliance on high-variance advantage signals. Integrating reinforcement learning with verifiable rewards (RLVR), RSPO employs an ELBO-based approximation coupled with a relative policy optimization mechanism. The approach achieves substantial performance gains on mathematical reasoning and planning tasks, demonstrating particularly strong results in planning and competitive performance in mathematical reasoning.
📝 Abstract
Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key observation: a reward advantage can be interpreted not only as an update direction, but also as a target for the relative log-ratio between the current and reference policies. Accordingly, RSPO calibrates this noisy relative log-ratio estimate by comparing its reward advantage with the reward-implied target relative log-ratio, updating the policy according to the gap between the current estimate and the target rather than the raw advantage alone. Experiments on mathematical reasoning and planning benchmarks show that RSPO yields especially strong gains on planning tasks and competitive mathematical-reasoning performance.
Problem

Research questions and friction points this paper is trying to address.

diffusion language models
reinforcement learning
sequence-level log-ratios
policy optimization
reward calibration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Language Models
Reinforcement Learning with Verifiable Rewards
Relative Score Policy Optimization
Policy Optimization
Sequence-level Log-ratio
🔎 Similar Papers
No similar papers found.