🤖 AI Summary
This work addresses the limitation of current large language models, which rely on single-pass generation during inference and struggle to improve personalized output quality even with increased computation. To overcome this, we propose a Test-Time Personalization (TTP) framework that samples multiple candidates from a personalized policy model and selects the best via a personalized reward model, enabling scalable optimization at inference time. We establish the first unified scaling law for Best-of-N performance of personalized reward models, revealing two failure modes—user-level collapse and query-level reward gaming—and introduce a probabilistic reward model with learnable variance to mitigate them. Experiments demonstrate that TTP consistently yields scaling gains across diverse policy models and personalized generation tasks, and the proposed scaling law accurately predicts empirical performance curves.
📝 Abstract
Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward-model variants.