P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of obtaining accurate user reward signals for personalized large language models in open-world scenarios, where diverse preference expressions and poor generalization to new users hinder performance. To tackle this, we propose P-GenRM, a personalized generative reward model that leverages structured evaluation chains to generate adaptive user profiles and scoring criteria. By integrating user prototype clustering with a dual-granularity (individual/prototype) preference scaling mechanism, P-GenRM dynamically fuses individual and similar-user preferences at test time. Our approach introduces, for the first time, test-time user scaling and prototype-based transfer strategies, effectively mitigating preference noise and enhancing generalization to unseen users. Experiments demonstrate that P-GenRM achieves an average improvement of 2.31% on mainstream benchmarks and excels on out-of-distribution data, with test-time scaling contributing an additional 3% performance gain.

Technology Category

Application Category

📝 Abstract
Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.
Problem

Research questions and friction points this paper is trying to address.

personalized alignment
reward modeling
user preferences
generalization
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized Reward Model
Generative Reward Modeling
User Prototypes
Test-time Scaling
Preference Generalization
🔎 Similar Papers
No similar papers found.