Latent Preference Bandits

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In personalized multi-armed bandits, limited decision points per individual and the difficulty of modeling few latent states hinder accurate characterization of heterogeneous reward responses. Method: We relax the standard latent-variable assumption—instead of estimating precise reward distributions, we model only the individual’s preference ranking over actions within each latent state. This ranking-based formulation accommodates diverse reward scales and enhances adaptability to individual heterogeneity. We propose a Bayesian online algorithm leveraging posterior sampling, where preference rankings are explicitly embedded into the inference process to substantially reduce exploration overhead. Results: Empirical evaluation shows that our approach matches the performance of conventional latent-variable bandit methods under homogeneous reward distributions, while significantly outperforming baselines under heterogeneous reward scales. These results validate that ordinal preference modeling effectively reduces exploration cost and improves generalization across users with disparate reward interpretations.

Technology Category

Application Category

📝 Abstract
Bandit algorithms are guaranteed to solve diverse sequential decision-making problems, provided that a sufficient exploration budget is available. However, learning from scratch is often too costly for personalization tasks where a single individual faces only a small number of decision points. Latent bandits offer substantially reduced exploration times for such problems, given that the joint distribution of a latent state and the rewards of actions is known and accurate. In practice, finding such a model is non-trivial, and there may not exist a small number of latent states that explain the responses of all individuals. For example, patients with similar latent conditions may have the same preference in treatments but rate their symptoms on different scales. With this in mind, we propose relaxing the assumptions of latent bandits to require only a model of the emph{preference ordering} of actions in each latent state. This allows problem instances with the same latent state to vary in their reward distributions, as long as their preference orderings are equal. We give a posterior-sampling algorithm for this problem and demonstrate that its empirical performance is competitive with latent bandits that have full knowledge of the reward distribution when this is well-specified, and outperforms them when reward scales differ between instances with the same latent state.
Problem

Research questions and friction points this paper is trying to address.

Reducing exploration costs in personalized sequential decision-making tasks
Relaxing strict latent state assumptions to require only action preference ordering
Handling varying reward scales among instances with identical latent states
Innovation

Methods, ideas, or system contributions that make the work stand out.

Relaxing latent bandit assumptions for preference ordering
Posterior-sampling algorithm for varied reward distributions
Competitive performance with known reward distributions