One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Personalized alignment of large language models faces significant challenges due to sparse user feedback and poor generalization to unseen users. To address these issues, this work proposes a Meta Reward Modeling (MRM) framework that formulates personalized alignment as a meta-learning problem. By constructing user representations through weighted basis functions and integrating a Robust Personalized Optimization (RPO) objective, MRM enables rapid preference adaptation from limited feedback. The approach is unified within the Model-Agnostic Meta-Learning (MAML) framework, jointly optimizing for both personalization and robustness. Experiments demonstrate that MRM consistently outperforms existing methods across multiple benchmark datasets, achieving superior performance in few-shot personalized alignment and enhanced robustness for hard-to-learn users.

Technology Category

Application Category

📝 Abstract

Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.

Problem

Research questions and friction points this paper is trying to address.

personalized alignment

reward modeling

few-shot adaptation

user preference

meta-learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta Reward Modeling

Personalized LLM Alignment

Model-Agnostic Meta-Learning