🤖 AI Summary
Existing personalized reward models often treat user context as static or implicit signals, struggling to capture the dynamic and multidimensional nature of human preferences. To address this limitation, this work proposes the P-Check framework, which introduces a learnable dynamic checklist mechanism for the first time. By employing a plug-and-play checklist generator, P-Check explicitly models the multidimensional criteria underlying personalized judgments and integrates a preference-aware contrastive weighting strategy to enhance reward discriminability. Experimental results demonstrate that the proposed method significantly outperforms baseline approaches in both reward prediction accuracy and downstream personalized generation tasks, while also exhibiting superior robustness in out-of-distribution scenarios.
📝 Abstract
Recent approaches in personalized reward modeling have primarily focused on leveraging user interaction history to align model judgments with individual preferences. However, existing approaches largely treat user context as a static or implicit conditioning signal, failing to capture the dynamic and multi-faceted nature of human judgment. In this paper, we propose P-Check, a novel personalized reward modeling framework, designed to train a plug-and-play checklist generator that synthesizes dynamic evaluation criteria for guiding the reward prediction. To better align these checklists with personalized nuances, we introduce Preference-Contrastive Criterion Weighting, a training strategy that assigns saliency scores to criteria based on their discriminative power for personalized judgment. We conduct extensive experiments and demonstrate that P-Check not only improves reward accuracy but also enhances downstream personalized generation, and remains robust in OOD scenarios.