On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

221K/year
🤖 AI Summary
This work investigates offline multi-armed bandit learning under KL regularization, aiming to characterize its optimal sample complexity. By integrating policy coverage coefficients, information-theoretic lower bound constructions, and algorithmic upper bound analysis, the study establishes matching minimax sample complexity bounds across the full range of regularization strengths for the first time. The proposed KL-PCB algorithm achieves a sample complexity of $\widetilde{O}(\eta \, \text{SAC}^{\pi^*}/\varepsilon)$ under strong regularization and $\widetilde{\Omega}(\text{SAC}^{\pi^*}/\varepsilon^2)$ under weak regularization, with upper and lower bounds aligning tightly throughout the entire regularization regime. These results nearly completely delineate the statistical limits of the problem.
📝 Abstract
Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(ηSAC^{π^*}/ε)$ under large regularization $η= \tilde{O}(ε^{-1})$, and a sample complexity of $\tildeΩ(SAC^{π^*}/ε^2)$ under small regularization $η= \tildeΩ(ε^{-1})$, where $η$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{π^*}$ policy coverage coefficient at the optimal policy $π^*$, $ε$ is the desired sub-optimality, and $\tilde{O}$ and $\tildeΩ$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.
Problem

Research questions and friction points this paper is trying to address.

offline multi-armed bandits
KL regularization
sample complexity
optimal policy
policy coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

KL regularization
offline multi-armed bandits
sample complexity
policy coverage
minimax optimality
🔎 Similar Papers
2024-06-05Neural Information Processing SystemsCitations: 1
2024-05-04International Conference on Machine LearningCitations: 12