Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the over-optimization problem in preference learning caused by unknown data distributions by proposing the PEPO algorithm. PEPO trains multiple preference optimization policies on disjoint data subsets and aggregates them via a worst-case strategy combination, enabling single-step DPO-style training without requiring an explicit reward model or prior knowledge of the data distribution. Notably, PEPO provides the first sample complexity guarantee under the mild assumption of a bounded concentrability coefficient for a single policy, effectively mitigating over-optimization risks. Both theoretical analysis and empirical evaluations demonstrate that PEPO achieves significantly better generalization than standard DPO while maintaining algorithmic simplicity.

Technology Category

Application Category

📝 Abstract

We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.

Problem

Research questions and friction points this paper is trying to address.

over-optimization

Direct Preference Optimization

preference learning

data distribution

pessimism

Innovation

Methods, ideas, or system contributions that make the work stand out.

PEPO

Direct Preference Optimization

over-optimization