🤖 AI Summary
Existing personalized alignment research is hindered by coarse-grained preference modeling and the absence of fine-grained, individual-level preference datasets. To address this, we introduce WikiPersona—the first fine-grained, real-world personalized alignment benchmark built upon extensively documented public figures. It requires models not only to generate responses aligned with a given persona’s preferences but also to produce verifiable background and preference descriptions, enabling interpretable alignment. Innovatively treating personas as prototypical individuals, we propose Preference Prefix Injection: a fine-tuning strategy that conditions generation on inferred individual preferences as prefix tokens. This yields a 12.3% improvement in alignment accuracy under preference conflict scenarios and significantly enhances cross-persona generalization and fairness. Comprehensive experiments—including few-shot prompting, supervised fine-tuning, and multi-dimensional evaluation—demonstrate superior verifiability, robustness, and personalized consistency over baselines.
📝 Abstract
Preference alignment has become a standard pipeline in finetuning models to follow emph{generic} human preferences. Majority of work seeks to optimize model to produce responses that would be preferable emph{on average}, simplifying the diverse and often emph{contradicting} space of human preferences. While research has increasingly focused on personalized alignment: adapting models to individual user preferences, there is a lack of personalized preference dataset which focus on nuanced individual-level preferences. To address this, we introduce WikiPersona: the first fine-grained personalization using well-documented, famous individuals. Our dataset challenges models to align with these personas through an interpretable process: generating verifiable textual descriptions of a persona's background and preferences in addition to alignment. We systematically evaluate different personalization approaches and find that as few-shot prompting with preferences and fine-tuning fail to simultaneously ensure effectiveness and efficiency, using extit{inferred personal preferences} as prefixes enables effective personalization, especially in topics where preferences clash while leading to more equitable generalization across unseen personas.