🤖 AI Summary
Preference learning lacks a mature theoretical foundation for evaluation.
Method: This paper proposes a unified theoretical framework centered on win rate—the probability that a model prefers the correct response—and rigorously proves it is the unique metric satisfying both preference-relation consistency and rationality under data-distribution priors. Based on this, mainstream methods are systematically categorized into win-rate optimization (WRO) and non-WRO classes.
Contribution/Results: We establish that WRO methods enjoy dual guarantees of statistical consistency and optimization tractability, whereas canonical non-WRO approaches—including DPO and SFT—exhibit fundamental theoretical limitations. Further analysis reveals that practical model performance is predominantly constrained by optimization difficulty rather than objective design; thus, optimization success rate is a stronger predictor of empirical performance than objective choice. These results provide a verifiable, principled foundation for evaluation, diagnostic analysis, and algorithm design in preference learning.
📝 Abstract
Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective's solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.