🤖 AI Summary
To address the challenge of personalizing large language models (LLMs) under scarce real-user preference data, this paper proposes Few-Shot Preference Optimization (FSPO): a meta-learning framework that reformulates reward modeling as a few-shot adaptation task, enabling rapid construction of personalized reward functions from minimal user preferences. We introduce the first LLM-based method for generating high-fidelity, million-scale synthetic preference data—ensuring both diversity and logical consistency to guarantee reliable generalization to real users—and integrate in-context learning to further enhance cross-user generalization. Evaluated on 1,500 synthetic users, FSPO achieves an 87% average win rate on AlpacaEval; on real human open-ended QA tasks, it attains a 72% win rate—substantially outperforming existing baselines. Our core contributions are (1) the first meta-learning paradigm for preference optimization tailored to low-resource personalization, and (2) a high-fidelity synthetic preference data generation mechanism grounded in LLM reasoning.
📝 Abstract
Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.