๐ค AI Summary
Existing post-training methods for generative modelsโsuch as RLHF and DPOโrely on pairwise preference comparisons over single samples, limiting their ability to model population-level properties like diversity and bias. This work proposes the first preference optimization framework based on *multi-sample* comparisons, introducing two novel algorithms: mDPO and mIPO. These methods directly optimize collective characteristics of generated outputs at the set level, extending DPO and IPO with intra-group consistency constraints and noise-robust mechanisms. Experiments demonstrate that the proposed framework significantly outperforms single-sample baselines in enhancing output diversity, mitigating bias, and maintaining robustness under label noise. The results validate both the effectiveness and necessity of multi-sample comparison for modeling and optimizing population-level behavioral traits in generative models.
๐ Abstract
Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.