Language Model Personalization via Reward Factorization

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RLHF methods rely on a monolithic preference model, failing to accommodate individual user differences. This work proposes a personalized alignment framework based on reward decomposition: user preferences are modeled as low-dimensional linear combinations of a shared base reward function, enabling preference inference from as few as ~10 user feedback instances—avoiding per-user standalone modeling. The method integrates reward factorization, a low-dimensional preference space assumption, an extended RLHF pipeline, linear reward composition, and policy fine-tuning. In real-user evaluations, it achieves a 67% win rate over GPT-4o’s default responses; both synthetic and real-world experiments confirm substantial personalization gains. To our knowledge, this is the first approach enabling efficient, scalable LLM personalization with minimal user interaction—demonstrating high sample efficiency, computational tractability, and empirical effectiveness in aligning large language models to heterogeneous user preferences.

Technology Category

Application Category

📝 Abstract
Modern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference model and fail to account for individual user preferences, limiting their effectiveness in personalized applications. We introduce a framework that extends RLHF to enable user personalization by leveraging the assumption that user preferences lie in a low-dimensional space. Instead of training a separate model per user, we represent user-specific rewards as a linear combination of base reward functions. Using only ~10 user responses, our method can infer user-specific rewards and align LLM outputs accordingly. We validate our approach through experiments with both synthetic and real users, demonstrating significant personalization achieved by our method. In human evaluations, our method achieves a 67% win rate over default GPT-4o responses.
Problem

Research questions and friction points this paper is trying to address.

Extends RLHF for personalized language model responses.
Represents user preferences via low-dimensional reward functions.
Achieves 67% win rate over default GPT-4o responses.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends RLHF for user personalization
Represents user rewards via linear combinations
Infers preferences from ~10 user responses
🔎 Similar Papers
No similar papers found.