Sequential Decision Problems with Missing Feedback

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies online sequential decision-making under stochastically missing reward signals, overcoming the limitation of classical UCB algorithms that require fully observable rewards. Addressing unknown and nonparametric missingness mechanisms, we propose the first Doubly Robust UCB (DR-UCB) algorithm: it explicitly models the missingness process by integrating doubly robust estimation with the upper confidence bound framework, achieving near-optimal $sqrt{T}$ worst-case regret without prior assumptions on the missingness mechanism. We theoretically establish high-probability convergence of DR-UCB under generalized dependence structures. Empirical simulations confirm that its empirical regret closely matches the theoretical bound. Our key contribution is the first incorporation of double robustness into online policy learning under nonparametric reward missingness, significantly enhancing both robustness and statistical efficiency of policy selection under incomplete feedback.

Technology Category

Application Category

📝 Abstract
This paper investigates the challenges of optimal online policy learning under missing data. State-of-the-art algorithms implicitly assume that rewards are always observable. I show that when rewards are missing at random, the Upper Confidence Bound (UCB) algorithm maintains optimal regret bounds; however, it selects suboptimal policies with high probability as soon as this assumption is relaxed. To overcome this limitation, I introduce a fully nonparametric algorithm-Doubly-Robust Upper Confidence Bound (DR-UCB)-which explicitly models the form of missingness through observable covariates and achieves a nearly-optimal worst-case regret rate of $widetilde{O}(sqrt{T})$. To prove this result, I derive high-probability bounds for a class of doubly-robust estimators that hold under broad dependence structures. Simulation results closely match the theoretical predictions, validating the proposed framework.
Problem

Research questions and friction points this paper is trying to address.

Optimal online policy learning with missing feedback
Overcoming suboptimal policies when rewards are missing
Nonparametric algorithm for missing data with optimal regret
Innovation

Methods, ideas, or system contributions that make the work stand out.

DR-UCB algorithm handles missing rewards robustly
Models missingness via observable covariates explicitly
Achieves near-optimal regret rate under dependence
🔎 Similar Papers
No similar papers found.