🤖 AI Summary
This work addresses the challenge of reward bias and optimal arm selection in multi-armed bandits under unobserved confounding and time-varying latent states. The authors propose a novel bandit algorithm that circumvents the need for explicit modeling of latent states by leveraging lagged contextual features and a collaborative probing strategy to implicitly track latent state dynamics and disentangle their influence on rewards. As the first approach to effectively handle non-stationarity and confounding bias without relying on explicit state models, the algorithm achieves both computational efficiency and strong adaptability. Empirical evaluations across diverse experimental settings demonstrate its significant superiority over classical methods, offering a practical and theoretically grounded solution for real-world deployment.
📝 Abstract
The multi-armed bandit problem is a core framework for sequential decision-making under uncertainty, but classical algorithms often fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection. We address key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies. These implicitly track latent states and disambiguate state-dependent reward patterns. Our methods and their adaptive variants can learn optimal policies without explicit state modeling, combining computational efficiency with robust adaptation to non-stationary rewards. Empirical results across diverse settings demonstrate superior performance over classical approaches, and we provide practical recommendations for algorithm selection in real-world applications.