Batched Nonparametric Contextual Bandits

📅 2024-02-27

🏛️ IEEE Transactions on Information Theory

📈 Citations: 1

✨ Influential: 1

career value

171K/year

🤖 AI Summary

This paper studies the nonparametric contextual bandit problem under batch constraints, where the reward function is an unknown smooth function of covariates and the policy is updated only upon completion of each batch. To address the exploration–exploitation trade-off inherent in batched learning, we propose a dynamic binning mechanism: bin widths adaptively scale with batch sizes, integrated with nonparametric regression and minimax analysis to achieve efficient estimation within the batched learning framework. We theoretically establish that only a constant number of policy updates suffice to attain the optimal online regret bound—up to logarithmic factors—and provide a matching lower bound. This is the first work to rigorously demonstrate performance equivalence between batched and online learning in the nonparametric setting, significantly reducing update frequency while enhancing practical deployability.

Technology Category

Application Category

📝 Abstract

We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for mathematical framework of contextual bandit, a nearly constant number of policy updates can attain optimal regret in the fully online setting.

Problem

Research questions and friction points this paper is trying to address.

Study nonparametric contextual bandits with batch constraints

Establish minimax regret lower bound and optimal algorithm

Dynamic covariate space splitting for optimal regret

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic covariate space splitting into bins

Optimal regret with constant policy updates

Batched nonparametric contextual bandits algorithm

🔎 Similar Papers

Diffusion Models Meet Contextual Bandits with Large Action Spaces