Bias Fitting to Mitigate Length Bias of Reward Model in RLHF

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In RLHF, reward models (RMs) commonly exhibit length bias—systematically assigning higher scores to longer responses irrespective of quality—enabling policies to exploit “reward hacking” rather than achieving genuine human preference alignment. Existing debiasing approaches either employ coarse-grained modeling or impose overly restrictive linear assumptions, failing to capture the inherent nonlinearity of length–reward dependencies. To address this, we propose FiMi-RM, the first framework that explicitly models the complex nonlinear relationship between response length and RM scores. FiMi-RM operates in three stages: (1) standard RM training, (2) lightweight MLP-based fitting of the length–reward bias, and (3) injection-style reward recalibration. The resulting debiased RM is both interpretable and adaptive. Experiments demonstrate that FiMi-RM significantly mitigates length–reward distribution imbalance, improves length-controllable win rates, and reduces redundant outputs—while preserving the RM’s overall preference modeling capability.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences. However, RLHF often suffers from reward hacking, wherein policy learning exploits flaws in the trained reward model to maximize reward scores without genuinely aligning with human preferences. A significant example of such reward hacking is length bias, where reward models usually favor longer responses irrespective of actual response quality. Previous works on length bias have notable limitations, these approaches either mitigate bias without characterizing the bias form, or simply assume a linear length-reward relation. To accurately model the intricate nature of length bias and facilitate more effective bias mitigation, we propose FiMi-RM (Bias Fitting to Mitigate Length Bias of Reward Model in RLHF), a framework that autonomously learns and corrects underlying bias patterns. Our approach consists of three stages: First, we train a standard reward model which inherently contains length bias. Next, we deploy a lightweight fitting model to explicitly capture the non-linear relation between length and reward. Finally, we incorporate this learned relation into the reward model to debias. Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution. Furthermore, when applied to alignment algorithms, our debiased reward model improves length-controlled win rate and reduces verbosity without compromising its performance.
Problem

Research questions and friction points this paper is trying to address.

Mitigates length bias in RLHF reward models
Addresses non-linear length-reward relationship
Improves alignment without compromising performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomously learns and corrects bias patterns
Deploys lightweight fitting model for non-linear relation
Incorporates learned relation to debias reward model
🔎 Similar Papers
No similar papers found.
K
Kangwen Zhao
University of Science and Technology of China
J
Jianfeng Cai
University of Science and Technology of China
Jinhua Zhu
Jinhua Zhu
University of Science and Technology of China
Machine Learning
R
Ruopei Sun
University of Science and Technology of China
D
Dongyun Xue
University of Science and Technology of China
Wengang Zhou
Wengang Zhou
Professor, EEIS Department, University of Science and Technology of China
Multimedia RetrievalComputer VisionComputer Game
L
Li Li
University of Science and Technology of China
Houqiang Li
Houqiang Li
Professor, Department of Electric Engineering and Information Science, University of Science and
Multimedia SearchImage/Video AnalysisImage/Video Coding