Aligning Diffusion Language Models via Unpaired Preference Optimization

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion language models (dLLMs) face significant challenges in human preference alignment due to the intractability of sequence-level likelihood computation and the high cost of acquiring pairwise preference labels. To address this, we propose the first alignment framework for dLLMs that requires no pairwise annotations: it jointly leverages an ELBO-based approximation of the log-likelihood and a prospect-theory-inspired, unpaired KTO objective, augmented with a gradient variance reduction strategy to ensure training stability. Our method enables end-to-end training on LLaDA-8B-Instruct, achieving adjusted win rates of 65.9% on kto-mix-14k and 62.3% on UltraFeedback-Binary. It also matches or exceeds baseline performance on downstream tasks including GSM8K and MMLU. The core contribution is the first unified modeling of ELBO and KTO for dLLMs, establishing a new paradigm that achieves high alignment quality with minimal labeling overhead.

Technology Category

Application Category

📝 Abstract
Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields extbf{65.9%} and extbf{62.3%} adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.
Problem

Research questions and friction points this paper is trying to address.

Aligning diffusion language models with human preferences
Overcoming intractable sequence likelihoods in diffusion models
Reducing reliance on costly pairwise preference data
Innovation

Methods, ideas, or system contributions that make the work stand out.

ELBO surrogate for diffusion log-likelihoods
Unpaired preference objective with KTO
Variance-reduction practices for stable training
🔎 Similar Papers
No similar papers found.
V
Vaibhav Jindal
LinkedIn Corporation, CA, USA
Hejian Sang
Hejian Sang
Linkedin
C
Chun-Mao Lai
University of California San Diego, CA, USA
Y
Yanning Chen
LinkedIn Corporation, CA, USA
Z
Zhipeng Wang
LinkedIn Corporation, CA, USA