🤖 AI Summary
Existing static reward models struggle to generalize to unseen human preference domains and lack robustness to heterogeneous preferences. This work proposes a Transformer-based contextual reward adaptation framework that infers the reward structure of unknown domains from only a few preference examples, enabling cross-domain adaptation without retraining. The approach innovatively incorporates human response time as an auxiliary signal to mitigate the asymptotic bias inherent in standard Transformers for reward modeling and integrates a preference-demonstration-driven reward inference mechanism. Experimental results demonstrate that the proposed method significantly enhances the generalization capability and robustness of reward models under distributional shift.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.