🤖 AI Summary
This paper studies the nonparametric contextual bandit problem under batch constraints, where the reward function is an unknown smooth function of covariates and the policy is updated only upon completion of each batch. To address the exploration–exploitation trade-off inherent in batched learning, we propose a dynamic binning mechanism: bin widths adaptively scale with batch sizes, integrated with nonparametric regression and minimax analysis to achieve efficient estimation within the batched learning framework. We theoretically establish that only a constant number of policy updates suffice to attain the optimal online regret bound—up to logarithmic factors—and provide a matching lower bound. This is the first work to rigorously demonstrate performance equivalence between batched and online learning in the nonparametric setting, significantly reducing update frequency while enhancing practical deployability.
📝 Abstract
We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for mathematical framework of contextual bandit, a nearly constant number of policy updates can attain optimal regret in the fully online setting.