🤖 AI Summary
Large language models (LLMs) face a cold-start challenge in personalization—existing approaches rely heavily on extensive historical user data and costly fine-tuning, hindering real-time adaptation to individual preferences.
Method: We propose T-POP, a test-time personalization paradigm that integrates dueling bandits with test-time alignment. Without updating model parameters, T-POP learns a lightweight reward function online via pairwise preference feedback and dynamically steers decoding through reward-guided sampling.
Contribution/Results: T-POP eliminates gradient-based optimization, drastically reducing computational overhead and inference latency. Experiments demonstrate that it surpasses state-of-the-art baselines with only a few interactions, and its personalization performance improves continuously with additional feedback. The method achieves high data efficiency, strong practicality, and scalability—enabling effective, low-cost, real-time LLM personalization for new users.
📝 Abstract
Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful responses. However, current personalization methods are ill-suited for new users, as they typically require either slow, resource-intensive fine-tuning or a substantial amount of pre-existing user data, creating a significant cold-start problem. To address this challenge, we introduce a new paradigm for real-time personalization by learning from online pairwise preference feedback collected during text generation. We propose T-POP (Test-Time Personalization with Online Preference Feedback}), a novel algorithm that synergistically combines test-time alignment with dueling bandits. Without updating the LLM parameters, T-POP steers the decoding process of a frozen LLM by learning a reward function online that captures user preferences. By leveraging dueling bandits, T-POP intelligently queries the user to efficiently balance between exploring their preferences and exploiting the learned knowledge to generate personalized text. Extensive experiments demonstrate that T-POP achieves rapid and data-efficient personalization, significantly outperforming existing baselines and showing consistent improvement with more user interactions.