π€ AI Summary
To address the high bias and poor robustness arising from single-sample contrastive learning in multi-response preference alignment, this paper proposes Weighted Group-wise Preference Optimization (WGPO). WGPO dynamically constructs contrastive groups comprising multiple positive and negative responses and introduces a dynamic weighting mechanism based on mean reward shift. It formulates a weighted group-wise contrastive loss and provides theoretical analysis showing that simultaneous optimization over multiple preferences significantly reduces alignment bias and enhances robustness. The method further integrates DPO generalization, reward bias modeling, and InfoNCE-based dynamical analysis. Evaluated on the UltraFeedback dataset, the resulting model achieves state-of-the-art performance on AlpacaEvalβs automated benchmark, substantially outperforming baselines including DPO and IPPO.
π Abstract
We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel extension of Direct Preference Optimization (DPO) designed to accommodate multiple dynamically chosen positive and negative responses for each query. SWEPO employs a weighted group contrastive loss, assigning weights to responses based on their deviation from the mean reward score. This approach effectively prioritizes responses that are significantly better or worse than the average, enhancing optimization. Our theoretical analysis demonstrates that simultaneously considering multiple preferences reduces alignment bias, resulting in more robust alignment. Additionally, we provide insights into the training dynamics of our loss function and a related function, InfoNCA. Empirical validation on the UltraFeedback dataset establishes SWEPO as state-of-the-art, with superior performance in downstream evaluations using the AlpacaEval dataset.