Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reward modeling (RM) suffers from inductive biases—such as response length, sycophancy, and formatting preferences—due to low-quality preference data, leading to reward hacking and overfitting. Existing debiasing methods are often limited to single biases or linear assumptions, failing to address complex, coupled nonlinear biases. This paper introduces the information bottleneck principle to RM training for the first time, proposing an information-theoretic debiasing framework that jointly maximizes mutual information between RM outputs and human preferences while minimizing mutual information with multiple bias attributes. The method is end-to-end differentiable, architecture-agnostic (compatible with LLaMA, Qwen, etc.), and enables simultaneous suppression of nonlinear, multi-faceted biases. Experiments demonstrate significant reduction in bias leakage across three canonical bias types. After RLHF fine-tuning, the debiased RM yields a 3.2% average win-rate improvement on AlpacaEval and ArenaHard, with enhanced generalization.

Technology Category

Application Category

📝 Abstract
Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, extit{e.g.}, Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called extbf{D}ebiasing via extbf{I}nformation optimization for extbf{R}M (DIR). Inspired by the information bottleneck (IB), we maximize the mutual information (MI) between RM scores and human preference pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations, broadly extending the real-world application scenarios for RM debiasing methods. In experiments, we verify the effectiveness of DIR with three types of inductive biases: extit{response length}, extit{sycophancy}, and extit{format}. We discover that DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities. The code and training recipes are available at https://github.com/Qwen-Applications/DIR.
Problem

Research questions and friction points this paper is trying to address.

Addresses inductive bias in reward models for RLHF
Mitigates complex biases beyond simple linear correlations
Enhances generalization and RLHF performance across benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-theoretic debiasing via mutual information optimization
Minimizes mutual information between reward outputs and biased attributes
Handles complex non-linear biases like length, sycophancy, and format
🔎 Similar Papers
No similar papers found.
Z
Zhuo Li
Qwen Large Model Application Team, Alibaba
Pengyu Cheng
Pengyu Cheng
Alibaba Group
machine learningnatural language processing
Z
Zhechao Yu
Qwen Large Model Application Team, Alibaba
F
Feifei Tong
Qwen Large Model Application Team, Alibaba
A
Anningzhe Gao
Shenzhen Research Institute of Big Data
T
Tsung-Hui Chang
The Chinese University of Hong Kong, Shenzhen Research Institute of Big Data
Xiang Wan
Xiang Wan
Shenzhen Research Institute of Big Data
BioinformaticsData MiningBig Data Analysis
E
Erchao Zhao
Qwen Large Model Application Team, Alibaba
X
Xiaoxi Jiang
Qwen Large Model Application Team, Alibaba
G
Guanjun Jiang
Qwen Large Model Application Team, Alibaba