How Neural Reward Models Learn Features for Policy Optimization: A Single-Index Analysis

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work investigates the impact of exponential reward weighting in KL-regularized policy optimization on the performance of neural reward models, revealing the downstream policy’s sensitivity to errors in reward-skewed regions and its feedback interaction with feature learning. Under a Gaussian single-index model, the authors propose a two-stage analytical framework: first recovering the latent direction from reward-weighted samples, then fitting the output layer via weighted ridge regression. For the first time, the coupling between reward modeling and policy optimization is integrated into single-index theoretical analysis. By combining Hermite expansions with neural feature learning theory, they prove that when the feature-learning temperature exceeds a critical threshold, a constant fraction of neurons accurately recovers the latent direction. They further establish an upper bound on the value gap of the tilted policy, characterizing the trade-off governed by the deployment temperature β₂ between performance gain and learning cost.

📝 Abstract

Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with $r^*(x) = σ^*(\langle θ^*, x\rangle)$ and $x \sim N(0, I_d)$. We analyze a two-stage neural reward model that first learns the hidden direction $θ^*$ from reward-weighted samples and then fits the readout layer by weighted ridge regression. Exponential reward weighting changes the Hermite signal available to the first layer; for any feature-learning temperature $β_1$ above a dimension-free $O(1)$ threshold, a constant fraction of neurons recover the hidden direction, with weak-recovery complexity governed by the generative exponent. After feature recovery, we derive tilted-policy value-gap bounds for an idealized label-weighted fit with weights $e^{y/β_2}$ and a more practical surrogate-weighted fit with weights $e^{r_{a_0}(x)/β_2}$. Keeping the $β_2$-dependence explicit yields an admissible set of deployment temperatures, balancing the gain from lowering $β_2$ against the learning cost amplified by exponential weighting; in the surrogate-weighted case, proxy-dependent factors shrink this admissible set.

Problem

Research questions and friction points this paper is trying to address.

reward modeling

policy optimization

feature learning

exponential weighting

value gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward modeling

single-index model

feature learning