🤖 AI Summary
This work addresses the over-optimization problem in preference learning caused by unknown data distributions by proposing the PEPO algorithm. PEPO trains multiple preference optimization policies on disjoint data subsets and aggregates them via a worst-case strategy combination, enabling single-step DPO-style training without requiring an explicit reward model or prior knowledge of the data distribution. Notably, PEPO provides the first sample complexity guarantee under the mild assumption of a bounded concentrability coefficient for a single policy, effectively mitigating over-optimization risks. Both theoretical analysis and empirical evaluations demonstrate that PEPO achieves significantly better generalization than standard DPO while maintaining algorithmic simplicity.
📝 Abstract
We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.