🤖 AI Summary
To address reward modeling bias arising from highly heterogeneous human preferences in large language model (LLM) training, this paper proposes a personalized reward modeling framework. It represents individual or group preferences as linear combinations of shared, disentangled universal reward features—enabling, for the first time, zero-shot user adaptation in feature space. The method comprises two core components: disentangled learning of reward features and a linear personalization mechanism, trained end-to-end on offline preference data and compatible with joint evaluation alongside LLMs. Experiments demonstrate that the framework significantly outperforms both non-adaptive and context-based personalized baselines in high-disagreement settings. In low-disagreement scenarios, it achieves comparable performance with a simpler architecture and more stable training, effectively balancing generalizability and personalization.
📝 Abstract
Reinforcement learning from human feedback usually models preferences using a reward model that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We propose a method to specialise a reward model to a person or group of people. Our approach builds on the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts, including models that do in-context personalisation. Depending on how much disagreement there is in the training data, our model either significantly outperforms the baselines or matches their performance with a simpler architecture and more stable training.