Multi-dimensional Preference Alignment by Conditioning Reward Itself

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Standard DPO forces multi-dimensional human preferences—e.g., aesthetics, semantic alignment—into a single scalar reward via the Bradley-Terry model, inducing inter-dimensional optimization conflicts and causing the model to forget ideal features along specific dimensions in globally suboptimal samples. This work proposes a conditional multi-reward DPO framework enabling dimension-wise independent alignment within a single diffusion model. Key contributions include: (1) decoupling the Bradley-Terry objective from preference vector conditioning; (2) introducing dimension-wise reward dropout to ensure balanced multi-axis optimization; and (3) supporting dynamic, fine-tuning-free multi-axis Classifier-Free Guidance at inference time. Evaluated on Stable Diffusion 1.5 and SDXL, the method significantly improves multi-dimensional alignment fidelity. Crucially, it enables real-time enhancement of any preference dimension during inference—without additional training or external reward models.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback has emerged as a standard for aligning diffusion models. However, we identify a fundamental limitation in the standard DPO formulation because it relies on the Bradley-Terry model to aggregate diverse evaluation axes like aesthetic quality and semantic alignment into a single scalar reward. This aggregation creates a reward conflict where the model is forced to unlearn desirable features of a specific dimension if they appear in a globally non-preferred sample. To address this issue, we propose Multi Reward Conditional DPO (MCDPO). This method resolves reward conflicts by introducing a disentangled Bradley-Terry objective. MCDPO explicitly injects a preference outcome vector as a condition during training, which allows the model to learn the correct optimization direction for each reward axis independently within a single network. We further introduce dimensional reward dropout to ensure balanced optimization across dimensions. Extensive experiments on Stable Diffusion 1.5 and SDXL demonstrate that MCDPO achieves superior performance on benchmarks. Notably, our conditional framework enables dynamic and multiple-axis control at inference time using Classifier Free Guidance to amplify specific reward dimensions without additional training or external reward models.

Problem

Research questions and friction points this paper is trying to address.

Addresses reward conflict in DPO formulation

Resolves multi-dimensional preference aggregation issue

Enables independent optimization across reward axes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces disentangled Bradley-Terry objective for reward conflicts

Uses preference outcome vector as condition for independent optimization

Applies dimensional reward dropout for balanced multi-axis control

🔎 Similar Papers

No similar papers found.