T-POP: Test-Time Personalization with Online Preference Feedback

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models (LLMs) face a cold-start challenge in personalization—existing approaches rely heavily on extensive historical user data and costly fine-tuning, hindering real-time adaptation to individual preferences. Method: We propose T-POP, a test-time personalization paradigm that integrates dueling bandits with test-time alignment. Without updating model parameters, T-POP learns a lightweight reward function online via pairwise preference feedback and dynamically steers decoding through reward-guided sampling. Contribution/Results: T-POP eliminates gradient-based optimization, drastically reducing computational overhead and inference latency. Experiments demonstrate that it surpasses state-of-the-art baselines with only a few interactions, and its personalization performance improves continuously with additional feedback. The method achieves high data efficiency, strong practicality, and scalability—enabling effective, low-cost, real-time LLM personalization for new users.

Technology Category

Application Category

📝 Abstract

Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful responses. However, current personalization methods are ill-suited for new users, as they typically require either slow, resource-intensive fine-tuning or a substantial amount of pre-existing user data, creating a significant cold-start problem. To address this challenge, we introduce a new paradigm for real-time personalization by learning from online pairwise preference feedback collected during text generation. We propose T-POP (Test-Time Personalization with Online Preference Feedback}), a novel algorithm that synergistically combines test-time alignment with dueling bandits. Without updating the LLM parameters, T-POP steers the decoding process of a frozen LLM by learning a reward function online that captures user preferences. By leveraging dueling bandits, T-POP intelligently queries the user to efficiently balance between exploring their preferences and exploiting the learned knowledge to generate personalized text. Extensive experiments demonstrate that T-POP achieves rapid and data-efficient personalization, significantly outperforming existing baselines and showing consistent improvement with more user interactions.

Problem

Research questions and friction points this paper is trying to address.

Personalizing LLMs for new users without pre-existing data

Addressing the cold-start problem in real-time personalization

Learning user preferences during text generation via feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online pairwise preference feedback collection

Test-time alignment with dueling bandits

Steering decoding without updating LLM parameters

🔎 Similar Papers

A Comprehensive Survey on Retrieval Methods in Recommender Systems

2024-07-11arXiv.orgCitations: 15