Bias Fitting to Mitigate Length Bias of Reward Model in RLHF

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

258K/year

🤖 AI Summary

In RLHF, reward models (RMs) commonly exhibit length bias—systematically assigning higher scores to longer responses irrespective of quality—enabling policies to exploit “reward hacking” rather than achieving genuine human preference alignment. Existing debiasing approaches either employ coarse-grained modeling or impose overly restrictive linear assumptions, failing to capture the inherent nonlinearity of length–reward dependencies. To address this, we propose FiMi-RM, the first framework that explicitly models the complex nonlinear relationship between response length and RM scores. FiMi-RM operates in three stages: (1) standard RM training, (2) lightweight MLP-based fitting of the length–reward bias, and (3) injection-style reward recalibration. The resulting debiased RM is both interpretable and adaptive. Experiments demonstrate that FiMi-RM significantly mitigates length–reward distribution imbalance, improves length-controllable win rates, and reduces redundant outputs—while preserving the RM’s overall preference modeling capability.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences. However, RLHF often suffers from reward hacking, wherein policy learning exploits flaws in the trained reward model to maximize reward scores without genuinely aligning with human preferences. A significant example of such reward hacking is length bias, where reward models usually favor longer responses irrespective of actual response quality. Previous works on length bias have notable limitations, these approaches either mitigate bias without characterizing the bias form, or simply assume a linear length-reward relation. To accurately model the intricate nature of length bias and facilitate more effective bias mitigation, we propose FiMi-RM (Bias Fitting to Mitigate Length Bias of Reward Model in RLHF), a framework that autonomously learns and corrects underlying bias patterns. Our approach consists of three stages: First, we train a standard reward model which inherently contains length bias. Next, we deploy a lightweight fitting model to explicitly capture the non-linear relation between length and reward. Finally, we incorporate this learned relation into the reward model to debias. Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution. Furthermore, when applied to alignment algorithms, our debiased reward model improves length-controlled win rate and reduces verbosity without compromising its performance.

Problem

Research questions and friction points this paper is trying to address.

Mitigates length bias in RLHF reward models

Addresses non-linear length-reward relationship

Improves alignment without compromising performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomously learns and corrects bias patterns

Deploys lightweight fitting model for non-linear relation

Incorporates learned relation to debias reward model

🔎 Similar Papers

Post-hoc Reward Calibration: A Case Study on Length Bias