Batched Nonparametric Contextual Bandits

📅 2024-02-27
🏛️ IEEE Transactions on Information Theory
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This paper studies the nonparametric contextual bandit problem under batch constraints, where the reward function is an unknown smooth function of covariates and the policy is updated only upon completion of each batch. To address the exploration–exploitation trade-off inherent in batched learning, we propose a dynamic binning mechanism: bin widths adaptively scale with batch sizes, integrated with nonparametric regression and minimax analysis to achieve efficient estimation within the batched learning framework. We theoretically establish that only a constant number of policy updates suffice to attain the optimal online regret bound—up to logarithmic factors—and provide a matching lower bound. This is the first work to rigorously demonstrate performance equivalence between batched and online learning in the nonparametric setting, significantly reducing update frequency while enhancing practical deployability.

Technology Category

Application Category

📝 Abstract
We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for mathematical framework of contextual bandit, a nearly constant number of policy updates can attain optimal regret in the fully online setting.
Problem

Research questions and friction points this paper is trying to address.

Study nonparametric contextual bandits with batch constraints
Establish minimax regret lower bound and optimal algorithm
Dynamic covariate space splitting for optimal regret
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic covariate space splitting into bins
Optimal regret with constant policy updates
Batched nonparametric contextual bandits algorithm
Rong Jiang
Rong Jiang
Committee on Computational and Applied Mathematics, University of Chicago
C
Cong Ma
Department of Statistics, University of Chicago