Reinforcement Learning from Denoising Feedback

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the fundamental challenge of policy loss estimation in diffusion language models by introducing RLDF, a novel training paradigm that incorporates denoising feedback into reinforcement learning for the first time. The method enables efficient and accurate policy gradient estimation by optimizing intermediate noisy states $x_t$ toward truncated clean targets $\hat{x}_0$ during the diffusion process, combined with weighted timestep sampling. Built upon the dLLM architecture—specifically LLaDA and Dream—RLDF significantly enhances model performance and generalization across multiple reasoning benchmarks. Furthermore, the authors release the Drift training framework, establishing a scalable and extensible reinforcement learning paradigm tailored for diffusion-based language models.

📝 Abstract

Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (dLLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training paradigm that leverages feedback obtained from rollout and training processes to facilitate accurate and efficient policy loss estimation. To balance the trade-off between computational efficiency and estimation effectiveness, RLDF optimizes the model toward the clipped clean state $\hat{x}_0$ from intermediate noisy states $x_t$, combined with weighted timestep sampling over $t$. Extensive experiments demonstrate that RLDF achieves consistent and substantial improvements in both performance and generalizability across two representative dLLM architectures, LLaDA and Dream, on multiple reasoning benchmarks. Our work lays a principled foundation for scalable reinforcement learning in diffusion language models. We build Drift, a training framework for dLLMs, available at https://github.com/ant-research/Drift.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Diffusion Language Models

Policy Loss Estimation

Denoising Feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning from Denoising Feedback

diffusion language models

policy loss estimation

clipped clean state