Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing alignment methods for diffusion models rely on binary pairwise preferences, which fail to fully exploit the rich information contained in multiple candidate images and their continuous rewards. This work proposes Diffusion LAIR, the first listwise, reward-aware preference optimization framework for diffusion models. By transforming continuous rewards from multiple images into centralized advantage weights, the method constructs an implicit reward optimization objective with quadratic regularization and derives a bounded closed-form solution, enabling stable and controllable joint weighted regression. Integrating advantage-weighted regression, implicit reward modeling, and an improved denoising loss, Diffusion LAIR significantly outperforms current baselines on both Stable Diffusion 1.5 and SDXL across three major tasks: text-to-image generation, compositional generation, and image editing.

📝 Abstract

Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.

Problem

Research questions and friction points this paper is trying to address.

preference optimization

diffusion models

listwise preference

reward-aware alignment

human feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

listwise preference optimization

reward-aware alignment

diffusion models