🤖 AI Summary
Diffusion language models (dLLMs) face significant challenges in human preference alignment due to the intractability of sequence-level likelihood computation and the high cost of acquiring pairwise preference labels. To address this, we propose the first alignment framework for dLLMs that requires no pairwise annotations: it jointly leverages an ELBO-based approximation of the log-likelihood and a prospect-theory-inspired, unpaired KTO objective, augmented with a gradient variance reduction strategy to ensure training stability. Our method enables end-to-end training on LLaDA-8B-Instruct, achieving adjusted win rates of 65.9% on kto-mix-14k and 62.3% on UltraFeedback-Binary. It also matches or exceeds baseline performance on downstream tasks including GSM8K and MMLU. The core contribution is the first unified modeling of ELBO and KTO for dLLMs, establishing a new paradigm that achieves high alignment quality with minimal labeling overhead.
📝 Abstract
Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields extbf{65.9%} and extbf{62.3%} adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.