๐ค AI Summary
This work proposes the Conditionally Coupled Contextual (C3) Thompson Sampling algorithm to address a non-stationary contextual bandit setting commonly encountered in recommendation systems, characterized by dense arm features, nonlinear reward functions, and time-varying contexts that preserve temporal correlations. C3 is the first method to jointly model these challenges within a unified framework, combining an enhanced NadarayaโWatson estimator in the embedding space with Thompson sampling to enable efficient online learning without frequent model retraining. Empirical evaluations demonstrate that C3 reduces average cumulative regret by 5.7% across four OpenML tabular datasets and achieves a 12.4% improvement in click-through rate on the MIND news recommendation benchmark, highlighting its effectiveness in real-world dynamic environments.
๐ Abstract
Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce conditionally coupled contextual C3 Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that C3 outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.