Structure from Strategic Interaction & Uncertainty Risk Sensitive Games for Robust Preference Learning

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
This work addresses the limitation of existing preference fine-tuning methods, which optimize only average win rates while neglecting systematic failures in tail or critical subgroups—such as safety-critical scenarios, specific prompts, or annotator-defined categories—leading to insufficient robustness. The authors propose a risk-sensitive preference game framework that formulates preference learning as a hierarchical game under convex risk measures. They design a two-timescale hypergradient algorithm with bias correction, which effectively controls structural and statistical biases introduced by nonlinear risk transformations while preserving monotonicity and convergence guarantees. Theoretical analysis includes characterization of Stackelberg equilibria and sample complexity bounds. Experiments demonstrate that the learned policies exhibit robust performance across diverse subgroups, are insensitive to risk parameters, and significantly enhance robustness in low-data regimes without compromising overall performance.
📝 Abstract
A growing line of work reframes preference-based fine-tuning of large language models game-theoretically: Nash Learning from Human Feedback (NLHF) recasts the problem as a zero-sum game over policies. However, optimization is over expected pairwise payoffs, thereby conflating policies with similar win rates but different tail behavior. As such, these methods are agnostic to where in the data distribution they succeed or fail: strong average performance can mask systematic failure across prompts, annotators, or safety-critical strata. We introduce risk-sensitive preference games, in which players optimize convex risk measures of their preference loss, exploiting structure in preference uncertainty. While risk-sensitivity generally breaks the zero-sum structure, we show that translation invariance of many risk metrics ensures that we retain monotonicity, yielding fast convergence of sample-efficient self-play methods. Furthermore, we establish algorithmic stability and offline sample complexity bounds that scale with risk, requiring simultaneous control of structural bias from nonlinear risk transformations, statistical bias in risk estimation, and concentration tailored to the risk-sensitive setting. To address statistical bias, we introduce a hierarchical game formulation and a two-timescale extragradient algorithm with bias correction that converges to the Stackelberg equilibrium and is especially effective in low-sample regimes. Empirically, risk-adjusted policies are robust across data strata, stable across risk choices, and match or exceed risk-neutral performance thereby achieving robustness without a performance tax.
Problem

Research questions and friction points this paper is trying to address.

preference learning
risk sensitivity
strategic interaction
uncertainty
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

risk-sensitive learning
preference games
convex risk measures
Stackelberg equilibrium
sample complexity
🔎 Similar Papers