🤖 AI Summary
This work investigates the impact of exponential reward weighting in KL-regularized policy optimization on the performance of neural reward models, revealing the downstream policy’s sensitivity to errors in reward-skewed regions and its feedback interaction with feature learning. Under a Gaussian single-index model, the authors propose a two-stage analytical framework: first recovering the latent direction from reward-weighted samples, then fitting the output layer via weighted ridge regression. For the first time, the coupling between reward modeling and policy optimization is integrated into single-index theoretical analysis. By combining Hermite expansions with neural feature learning theory, they prove that when the feature-learning temperature exceeds a critical threshold, a constant fraction of neurons accurately recovers the latent direction. They further establish an upper bound on the value gap of the tilted policy, characterizing the trade-off governed by the deployment temperature β₂ between performance gain and learning cost.
📝 Abstract
Reward modeling is not only a prediction problem: in KL-regularized policy optimization, the learned reward is exponentiated to define the deployed policy, so downstream value depends on errors in reward-tilted regions. We study this feedback in a Gaussian single-index model with $r^*(x) = σ^*(\langle θ^*, x\rangle)$ and $x \sim N(0, I_d)$. We analyze a two-stage neural reward model that first learns the hidden direction $θ^*$ from reward-weighted samples and then fits the readout layer by weighted ridge regression. Exponential reward weighting changes the Hermite signal available to the first layer; for any feature-learning temperature $β_1$ above a dimension-free $O(1)$ threshold, a constant fraction of neurons recover the hidden direction, with weak-recovery complexity governed by the generative exponent. After feature recovery, we derive tilted-policy value-gap bounds for an idealized label-weighted fit with weights $e^{y/β_2}$ and a more practical surrogate-weighted fit with weights $e^{r_{a_0}(x)/β_2}$. Keeping the $β_2$-dependence explicit yields an admissible set of deployment temperatures, balancing the gain from lowering $β_2$ against the learning cost amplified by exponential weighting; in the surrogate-weighted case, proxy-dependent factors shrink this admissible set.