Reward Models Inherit Value Biases from Pretraining

๐Ÿ“… 2026-01-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Reward models (RMs) play a pivotal role in aligning large language models with human values, yet the extent to which they inherit value-laden biases from their pretrained base models remains unclear. This work presents the first systematic investigation demonstrating that, even when trained on identical preference data and fine-tuning protocols, RMs exhibit value preferences dictated by their underlying base architectures. Leveraging psycholinguistic corpora and the โ€œAgencyโ€“Communionโ€ (Big Two) framework of social values, combined with logits-difference modeling and ablation studies, we find that Llama-based RMs consistently favor agency-oriented values, whereas Gemma-based RMs show a marked inclination toward communion-oriented values. Critically, this divergence persists robustly across varied training configurations, indicating that such value orientations are already embedded during pretraining.

Technology Category

Application Category

๐Ÿ“ Abstract
Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the"Big Two"psychological axes, we show a robust preference of Llama RMs for"agency"and a corresponding robust preference of Gemma RMs for"communion."This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers'choice of base model is as much a consideration of values as of performance.
Problem

Research questions and friction points this paper is trying to address.

Reward Models
Value Biases
Pretraining
Human Alignment
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward models
value bias
pretraining
implicit reward
alignment
๐Ÿ”Ž Similar Papers
No similar papers found.