Diffusion-State Policy Optimization for Masked Diffusion Language Models

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the challenge of credit assignment in masked diffusion language models, which rely solely on final rewards and struggle to effectively attribute credit to intermediate fill-in decisions. To this end, the authors propose DiSPO, a method that resamples fill-in content at intermediate masked states by leveraging cached logits to generate new completions and updates only the newly filled tokens, thereby enabling direct optimization of intermediate decisions. DiSPO introduces a plug-and-play credit assignment mechanism for intermediate states, jointly optimizing intermediate and final decisions without requiring additional multi-step inference. Its policy gradient-based objective, conditioned on fixed intermediate states, combines branch-completion scores with terminal feedback and shares rollouts for efficient gradient estimation. Evaluated on LLaDA-8B-Instruct, DiSPO significantly outperforms the diffu-GRPO baseline—which uses only terminal feedback—on mathematical and planning tasks, with comparable computational overhead.

Technology Category

Application Category

📝 Abstract

Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens -- without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .

Problem

Research questions and friction points this paper is trying to address.

masked diffusion language models

credit assignment

intermediate decisions

terminal reward

denoising steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-State Policy Optimization

Masked Diffusion Language Models

Credit Assignment