🤖 AI Summary
This work addresses a critical limitation in existing single-dimension reward model debiasing methods, which often shift optimization pressure rather than genuinely mitigating bias, leading to “bias substitution.” We formally introduce and categorize three failure modes of bias mitigation—successful mitigation, bias substitution, and overcorrection—and reveal a measurement–optimization gap between audit distributions and policy-induced distributions. Integrating preference learning, GRPO reinforcement learning, length penalties, best-of-N sampling, and multidimensional bias tracking within an RLHF framework, we empirically demonstrate that length penalties degrade factual accuracy and that current debiasing approaches generally fail under the policy distribution, confirming the prevalence of bias substitution. Our findings establish a new framework for comprehensively evaluating and improving reward model debiasing.
📝 Abstract
Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during mitigation evaluation and policy training. We formalize mitigation outcomes into a regime taxonomy and prove that successful mitigation, bias substitution, and overcorrection produce identical observables under any audit-distribution scoring, including ranking accuracy and win-rate, even when granted oracle access to the true reward. Across published preference-learning mitigation work, no method we survey reports the evidence needed to certify successful mitigation. Augmenting evaluation with policy-induced distributions while tracking multiple biases provably closes the gap, and we translate this into actionable prescriptions for mitigation methods and benchmarks. We demonstrate bias substitution in language model RLHF, where a length penalty during GRPO training compresses responses as intended yet redirects optimization pressure onto confidence calibration, driving the policy into overconfidence while factual free-form accuracy falls. We also show a published length-debiasing operator that zeroes reward-length correlation on the audit distribution but reintroduces bias under best-of-N selection on three of four SOTA reward models, and a length-sycophancy coupling whose direction reverses under human-LLM judge disagreement.