SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment

📅 2024-12-05

📈 Citations: 1

✨ Influential: 0

career value

156K/year

🤖 AI Summary

To address the high bias and poor robustness arising from single-sample contrastive learning in multi-response preference alignment, this paper proposes Weighted Group-wise Preference Optimization (WGPO). WGPO dynamically constructs contrastive groups comprising multiple positive and negative responses and introduces a dynamic weighting mechanism based on mean reward shift. It formulates a weighted group-wise contrastive loss and provides theoretical analysis showing that simultaneous optimization over multiple preferences significantly reduces alignment bias and enhances robustness. The method further integrates DPO generalization, reward bias modeling, and InfoNCE-based dynamical analysis. Evaluated on the UltraFeedback dataset, the resulting model achieves state-of-the-art performance on AlpacaEval’s automated benchmark, substantially outperforming baselines including DPO and IPPO.

Technology Category

Application Category

📝 Abstract

We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel extension of Direct Preference Optimization (DPO) designed to accommodate multiple dynamically chosen positive and negative responses for each query. SWEPO employs a weighted group contrastive loss, assigning weights to responses based on their deviation from the mean reward score. This approach effectively prioritizes responses that are significantly better or worse than the average, enhancing optimization. Our theoretical analysis demonstrates that simultaneously considering multiple preferences reduces alignment bias, resulting in more robust alignment. Additionally, we provide insights into the training dynamics of our loss function and a related function, InfoNCA. Empirical validation on the UltraFeedback dataset establishes SWEPO as state-of-the-art, with superior performance in downstream evaluations using the AlpacaEval dataset.

Problem

Research questions and friction points this paper is trying to address.

Multi-criteria decision making

Fairness optimization

Group preference aggregation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SWEPO

Multi-solution Optimization

Dynamic Weight Adjustment

🔎 Similar Papers

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization