🤖 AI Summary
In real-world recommendation systems, users often cannot provide explicit numerical feedback but only pairwise preference comparisons (dueling feedback), posing a significant challenge for sequential decision-making. Method: This paper proposes the first online clustering dueling bandits framework for multi-user settings, introducing clustering into dueling bandits to exploit user similarity. We design two provably optimal algorithms: COLDB, a linear model capturing contextual information, and CONDB, a neural-network-based model capturing nonlinear preference structures. Contribution/Results: We derive rigorous upper bounds on cumulative regret, proving that user collaboration substantially reduces regret. Experiments on synthetic and real-world datasets demonstrate that our methods outperform existing baselines by 12%–28% in recommendation accuracy. This work breaks the conventional multi-armed bandit (MAB) reliance on numerical rewards and establishes a new paradigm for preference-driven, collaborative sequential decision-making.
📝 Abstract
The contextual multi-armed bandit (MAB) is a widely used framework for problems requiring sequential decision-making under uncertainty, such as recommendation systems. In applications involving a large number of users, the performance of contextual MAB can be significantly improved by facilitating collaboration among multiple users. This has been achieved by the clustering of bandits (CB) methods, which adaptively group the users into different clusters and achieve collaboration by allowing the users in the same cluster to share data. However, classical CB algorithms typically rely on numerical reward feedback, which may not be practical in certain real-world applications. For instance, in recommendation systems, it is more realistic and reliable to solicit preference feedback between pairs of recommended items rather than absolute rewards. To address this limitation, we introduce the first"clustering of dueling bandit algorithms"to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions. Both algorithms are supported by rigorous theoretical analyses, demonstrating that user collaboration leads to improved regret bounds. Extensive empirical evaluations on synthetic and real-world datasets further validate the effectiveness of our methods, establishing their potential in real-world applications involving multiple users with preference-based feedback.