SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Diffusion language models (dLLMs) pose a challenge for reinforcement learning (RL) alignment due to the non-differentiability of their likelihood, rendering standard policy gradient methods inapplicable. To address this, we propose a “sandwich” policy gradient framework that jointly constructs differentiable upper and lower surrogate bounds on the log-likelihood. Specifically, it synergistically combines the evidence lower bound (ELBO) and one-step estimation to tightly approximate the true log-likelihood, thereby eliminating gradient bias inherent in one-sided approximations. Crucially, our method avoids both reparameterization and auxiliary likelihood estimation models, preserving optimization fidelity. Empirical evaluation on GSM8K, MATH500, Countdown, and Sudoku demonstrates absolute accuracy improvements of 3.6%, 2.6%, 18.4%, and 27.0%, respectively—substantially outperforming existing RL alignment approaches for dLLMs.

Technology Category

Application Category

📝 Abstract

Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

Problem

Research questions and friction points this paper is trying to address.

Aligning diffusion language models with human preferences

Overcoming intractable log-likelihood in policy gradient methods

Reducing policy gradient bias through bounded likelihood approximations

Innovation

Methods, ideas, or system contributions that make the work stand out.

SPG uses upper and lower log-likelihood bounds

It enables reinforcement learning for diffusion language models

SPG outperforms ELBO and one-step estimation baselines

🔎 Similar Papers

TEncDM: Understanding the Properties of the Diffusion Model in the Space of Language Model Encodings