Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

๐Ÿ“… 2025-10-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the problem of reward hacking in large language model (LLM) personalization, which leads to redundant and superficial responses. To mitigate this, we propose the Critique-Post-Edit (CPE) reinforcement learning framework. Methodologically, CPE features: (1) a generative reward model (GRM) that jointly leverages multi-dimensional scalar scores and natural-language critiques to explicitly encode user preferences; (2) a policy-model self-correction mechanism that dynamically refines outputs during inference to suppress reward hacking; and (3) integrated length-controllability assessment with proximal policy optimization (PPO). Under strict token-length constraints, CPE achieves an 11% average win-rate improvement for Qwen2.5-7B over baseline PPO, and Qwen2.5-14B surpasses GPT-4โ€”demonstrating superior fidelity, precision, and controllability in personalized generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
Problem

Research questions and friction points this paper is trying to address.

Faithfully aligning LLMs with individual user preferences
Overcoming reward hacking in scalar-based personalization methods
Enabling controllable personalization through critique-based reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Personalized Generative Reward Model for multi-dimensional scoring
Integrates Critique-Post-Edit mechanism for output self-revision
Combines textual critiques with reinforcement learning framework
๐Ÿ”Ž Similar Papers
No similar papers found.