Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of reward bias and policy misalignment in large language models (LLMs) caused by erroneous assumptions about the link function under unknown preference noise. To resolve this, we propose a semiparametric single-index modeling framework that abandons the restrictive assumption of a fixed link function. Methodologically, we formulate preference alignment as a single-index model with an unconstrained link function—the first such approach—and introduce link-function profiling and orthogonalization techniques, alongside a link-function-agnostic bipartite ranking loss. Integrating *f*-divergence-constrained reward maximization with first-order neural network optimization, we directly optimize the policy rather than fitting a reward model. Theoretically, we establish a policy error bound dependent solely on the complexity of the index function class, ensuring robustness to unknown noise distributions and scaling. Empirical evaluation on real-world preference data demonstrates efficient and robust policy alignment.

Technology Category

Application Category

📝 Abstract
Aligning large language models to preference data is commonly implemented by assuming a known link function between the distribution of observed preferences and the unobserved rewards (e.g., a logistic link as in Bradley-Terry). If the link is wrong, however, inferred rewards can be biased and policies be misaligned. We study policy alignment to preferences under an unknown and unrestricted link. We consider an $f$-divergence-constrained reward maximization problem and show that realizability of the solution in a policy class implies a semiparametric single-index binary choice model, where a scalar-valued index determined by a policy captures the dependence on demonstrations and the rest of the preference distribution is an unrestricted function thereof. Rather than focus on estimation of identifiable finite-dimensional structural parameters in the index as in econometrics, we focus on policy learning, focusing on error to the optimal policy and allowing unidentifiable and nonparametric indices. We develop a variety of policy learners based on profiling the link function, orthogonalizing the link function, and using link-agnostic bipartite ranking objectives. We analyze these and provide finite-sample policy error bounds that depend on generic functional complexity measures of the index class. We further consider practical implementations using first-order optimization suited to neural networks and batched data. The resulting methods are robust to unknown preference noise distribution and scale, while preserving the direct optimization of policies without explicitly fitting rewards.
Problem

Research questions and friction points this paper is trying to address.

Aligns language models to preferences with unknown link functions
Develops robust policy learners for unknown preference noise distributions
Focuses on policy learning without explicit reward fitting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semiparametric single-index model for preference alignment
Profiling and orthogonalizing unknown link functions
Link-agnostic bipartite ranking objectives for policy learning
🔎 Similar Papers
No similar papers found.