A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems

๐Ÿ“… 2026-03-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes the Conditionally Coupled Contextual (C3) Thompson Sampling algorithm to address a non-stationary contextual bandit setting commonly encountered in recommendation systems, characterized by dense arm features, nonlinear reward functions, and time-varying contexts that preserve temporal correlations. C3 is the first method to jointly model these challenges within a unified framework, combining an enhanced Nadarayaโ€“Watson estimator in the embedding space with Thompson sampling to enable efficient online learning without frequent model retraining. Empirical evaluations demonstrate that C3 reduces average cumulative regret by 5.7% across four OpenML tabular datasets and achieves a 12.4% improvement in click-through rate on the MIND news recommendation benchmark, highlighting its effectiveness in real-world dynamic environments.

Technology Category

Application Category

๐Ÿ“ Abstract
Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce conditionally coupled contextual C3 Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that C3 outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.
Problem

Research questions and friction points this paper is trying to address.

contextual bandits
non-stationary rewards
feature-rich arms
correlated bandits
non-linear reward functions
Innovation

Methods, ideas, or system contributions that make the work stand out.

contextual bandits
non-stationary rewards
Thompson sampling
Nadaraya-Watson estimator
online learning
๐Ÿ”Ž Similar Papers
No similar papers found.