Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge that diffusion-based large language models (dLLMs) lack token-level conditional probability decomposition—rendering them incompatible with standard token-level reinforcement learning (RL) methods such as GRPO—this paper proposes the first principled, end-to-end sequence-level RL framework. Methodologically, it introduces a differentiable, stable, and scalable policy optimization objective by leveraging the evidence lower bound (ELBO), derived via variational inference, as a sequence-level likelihood surrogate; further incorporating per-token importance-weight normalization and robust KL divergence estimation. Crucially, the framework supports large-scale training without requiring autoregressive assumptions. Empirically, it achieves significant improvements over baselines across mathematical reasoning, code generation, and planning tasks: +20–40 points on Countdown, and consistent gains on mainstream mathematical and coding benchmarks.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.
Problem

Research questions and friction points this paper is trying to address.

Adapts RL to diffusion LLMs lacking token-level probabilities
Proposes sequence-level RL using ELBO as likelihood proxy
Enhances performance on reasoning, coding, and planning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequence-level policy optimization using ELBO proxy
Per-token normalization for stable importance ratios
Robust KL-divergence estimation in large-scale training
🔎 Similar Papers
No similar papers found.
J
Jingyang Ou
Gaoling School of Artificial Intelligence, Renmin University of China
J
Jiaqi Han
Stanford University
Minkai Xu
Minkai Xu
Stanford University
Generative AI
S
Shaoxuan Xu
Gaoling School of Artificial Intelligence, Renmin University of China
Jianwen Xie
Jianwen Xie
Research Scientist
Generative ModelsAI for ScienceComputer Vision
Stefano Ermon
Stefano Ermon
Stanford University
Artificial IntelligenceMachine Learning
Y
Yi Wu
Tsinghua University
Chongxuan Li
Chongxuan Li
Associate Professor, Renmin University of China
Machine LearningGenerative ModelsDeep Learning