Preference Optimization with Multi-Sample Comparisons

๐Ÿ“… 2024-10-16
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 8
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing post-training methods for generative modelsโ€”such as RLHF and DPOโ€”rely on pairwise preference comparisons over single samples, limiting their ability to model population-level properties like diversity and bias. This work proposes the first preference optimization framework based on *multi-sample* comparisons, introducing two novel algorithms: mDPO and mIPO. These methods directly optimize collective characteristics of generated outputs at the set level, extending DPO and IPO with intra-group consistency constraints and noise-robust mechanisms. Experiments demonstrate that the proposed framework significantly outperforms single-sample baselines in enhancing output diversity, mitigating bias, and maintaining robustness under label noise. The results validate both the effectiveness and necessity of multi-sample comparison for modeling and optimizing population-level behavioral traits in generative models.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.
Problem

Research questions and friction points this paper is trying to address.

Extends post-training with multi-sample comparisons for generative models
Improves diversity and bias assessment in generative model outputs
Provides robust optimization for datasets with label noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends post-training with multi-sample comparisons
Introduces mDPO and mIPO for group-wise optimization
Enhances diversity and bias assessment in generative models
๐Ÿ”Ž Similar Papers
No similar papers found.