🤖 AI Summary
To address the reliability bottlenecks in reinforcement learning for diffusion-based large language models (dLLMs)—namely, inaccurate advantage estimation and substantial prediction probability bias—this paper proposes the first reliable RL framework based on tree-structured rollouts and bottom-up advantage computation. Methodologically, we design a tree-structured rollout mechanism with verifiable stepwise reward assignment; theoretically establish a negative correlation between prediction confidence and probability estimation error; and introduce a temporal self-distillation loss to calibrate late-stage predictions. Our contributions include: (i) a formal bias-theoretic analysis and a general calibration paradigm, and (ii) dual verifiability guarantees for both advantage and probability estimation. Experiments demonstrate significant improvements of +86.2, +51.6, +4.5, and +5.3 points on Sudoku, Countdown, GSM8K, and Math500, respectively, substantially outperforming baselines. Ablation and efficiency analyses confirm the method’s practicality and computational effectiveness.
📝 Abstract
Reliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices.