Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the over-optimization problem in preference learning caused by unknown data distributions by proposing the PEPO algorithm. PEPO trains multiple preference optimization policies on disjoint data subsets and aggregates them via a worst-case strategy combination, enabling single-step DPO-style training without requiring an explicit reward model or prior knowledge of the data distribution. Notably, PEPO provides the first sample complexity guarantee under the mild assumption of a bounded concentrability coefficient for a single policy, effectively mitigating over-optimization risks. Both theoretical analysis and empirical evaluations demonstrate that PEPO achieves significantly better generalization than standard DPO while maintaining algorithmic simplicity.

Technology Category

Application Category

📝 Abstract
We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.
Problem

Research questions and friction points this paper is trying to address.

over-optimization
Direct Preference Optimization
preference learning
data distribution
pessimism
Innovation

Methods, ideas, or system contributions that make the work stand out.

PEPO
Direct Preference Optimization
over-optimization
pessimistic ensemble
concentrability coefficient