In-Context Reward Adaptation for Robust Preference Modeling

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing static reward models struggle to generalize to unseen human preference domains and lack robustness to heterogeneous preferences. This work proposes a Transformer-based contextual reward adaptation framework that infers the reward structure of unknown domains from only a few preference examples, enabling cross-domain adaptation without retraining. The approach innovatively incorporates human response time as an auxiliary signal to mitigate the asymptotic bias inherent in standard Transformers for reward modeling and integrates a preference-demonstration-driven reward inference mechanism. Experimental results demonstrate that the proposed method significantly enhances the generalization capability and robustness of reward models under distributional shift.

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.

Problem

Research questions and friction points this paper is trying to address.

reward modeling

preference generalization

distribution shift

human feedback

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Learning

Reward Modeling

Preference Adaptation