๐ค AI Summary
This paper addresses the problem of reward hacking in large language model (LLM) personalization, which leads to redundant and superficial responses. To mitigate this, we propose the Critique-Post-Edit (CPE) reinforcement learning framework. Methodologically, CPE features: (1) a generative reward model (GRM) that jointly leverages multi-dimensional scalar scores and natural-language critiques to explicitly encode user preferences; (2) a policy-model self-correction mechanism that dynamically refines outputs during inference to suppress reward hacking; and (3) integrated length-controllability assessment with proximal policy optimization (PPO). Under strict token-length constraints, CPE achieves an 11% average win-rate improvement for Qwen2.5-7B over baseline PPO, and Qwen2.5-14B surpasses GPT-4โdemonstrating superior fidelity, precision, and controllability in personalized generation.
๐ Abstract
Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.