🤖 AI Summary
This paper studies online sequential decision-making under stochastically missing reward signals, overcoming the limitation of classical UCB algorithms that require fully observable rewards. Addressing unknown and nonparametric missingness mechanisms, we propose the first Doubly Robust UCB (DR-UCB) algorithm: it explicitly models the missingness process by integrating doubly robust estimation with the upper confidence bound framework, achieving near-optimal $sqrt{T}$ worst-case regret without prior assumptions on the missingness mechanism. We theoretically establish high-probability convergence of DR-UCB under generalized dependence structures. Empirical simulations confirm that its empirical regret closely matches the theoretical bound. Our key contribution is the first incorporation of double robustness into online policy learning under nonparametric reward missingness, significantly enhancing both robustness and statistical efficiency of policy selection under incomplete feedback.
📝 Abstract
This paper investigates the challenges of optimal online policy learning under missing data. State-of-the-art algorithms implicitly assume that rewards are always observable. I show that when rewards are missing at random, the Upper Confidence Bound (UCB) algorithm maintains optimal regret bounds; however, it selects suboptimal policies with high probability as soon as this assumption is relaxed. To overcome this limitation, I introduce a fully nonparametric algorithm-Doubly-Robust Upper Confidence Bound (DR-UCB)-which explicitly models the form of missingness through observable covariates and achieves a nearly-optimal worst-case regret rate of $widetilde{O}(sqrt{T})$. To prove this result, I derive high-probability bounds for a class of doubly-robust estimators that hold under broad dependence structures. Simulation results closely match the theoretical predictions, validating the proposed framework.