What Does Preference Learning Recover from Pairwise Comparison Data?

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

When real-world preference data violate the assumptions of the Bradley–Terry (BT) model, it remains unclear what the BT learning procedure actually recovers. This work addresses this gap by formalizing preference information through the conditional preference response distribution (CPRD) derived from triplet comparison data and establishes, for the first time, a data-centric theoretical framework to characterize the target recovered by the BT model under non-ideal conditions. By integrating graph connectivity, statistical learning theory, and BT model analysis, we precisely delineate the conditions under which the BT model is valid and reveal the critical roles of the marginal distribution and preference graph connectivity in determining sample efficiency. Our results provide a rigorous theoretical foundation for preference learning and alignment tasks under general preference data settings.

Technology Category

Application Category

📝 Abstract

Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets $(x, y^+, y^-)$, where response $y^+$ is preferred over response $y^-$ for context $x$. The Bradley--Terry (BT) model is the predominant approach, modeling preference probabilities as a function of latent score differences. Standard practice assumes data follows this model and learns the latent scores accordingly. However, real data may violate this assumption, and it remains unclear what BT learning recovers in such cases. Starting from triplet comparison data, we formalize the preference information it encodes through the conditional preference distribution (CPRD). We give precise conditions for when BT is appropriate for modeling the CPRD, and identify factors governing sample efficiency -- namely, margin and connectivity. Together, these results offer a data-centric foundation for understanding what preference learning actually recovers.

Problem

Research questions and friction points this paper is trying to address.

preference learning

pairwise comparison

Bradley-Terry model

conditional preference distribution

model misspecification

Innovation

Methods, ideas, or system contributions that make the work stand out.

preference learning

Bradley-Terry model

conditional preference distribution