Capturing Individual Human Preferences with Reward Features

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address reward modeling bias arising from highly heterogeneous human preferences in large language model (LLM) training, this paper proposes a personalized reward modeling framework. It represents individual or group preferences as linear combinations of shared, disentangled universal reward features—enabling, for the first time, zero-shot user adaptation in feature space. The method comprises two core components: disentangled learning of reward features and a linear personalization mechanism, trained end-to-end on offline preference data and compatible with joint evaluation alongside LLMs. Experiments demonstrate that the framework significantly outperforms both non-adaptive and context-based personalized baselines in high-disagreement settings. In low-disagreement scenarios, it achieves comparable performance with a simpler architecture and more stable training, effectively balancing generalizability and personalization.

Technology Category

Application Category

📝 Abstract

Reinforcement learning from human feedback usually models preferences using a reward model that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We propose a method to specialise a reward model to a person or group of people. Our approach builds on the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts, including models that do in-context personalisation. Depending on how much disagreement there is in the training data, our model either significantly outperforms the baselines or matches their performance with a simpler architecture and more stable training.

Problem

Research questions and friction points this paper is trying to address.

Modeling individual human preferences in reinforcement learning

Adapting reward models to specific individuals or groups

Addressing preference disagreement in large language model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized reward model for individual preferences

Linear combination of general reward features

Fast adaptation to specific user preferences

🔎 Similar Papers

No similar papers found.

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Research Engineer, Language - Personalization, Meta Superintelligence Labs