On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work investigates offline multi-armed bandit learning under KL regularization, aiming to characterize its optimal sample complexity. By integrating policy coverage coefficients, information-theoretic lower bound constructions, and algorithmic upper bound analysis, the study establishes matching minimax sample complexity bounds across the full range of regularization strengths for the first time. The proposed KL-PCB algorithm achieves a sample complexity of $\widetilde{O}(\eta \, \text{SAC}^{\pi^*}/\varepsilon)$ under strong regularization and $\widetilde{\Omega}(\text{SAC}^{\pi^*}/\varepsilon^2)$ under weak regularization, with upper and lower bounds aligning tightly throughout the entire regularization regime. These results nearly completely delineate the statistical limits of the problem.

📝 Abstract

Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(ηSAC^{π^*}/ε)$ under large regularization $η= \tilde{O}(ε^{-1})$, and a sample complexity of $\tildeΩ(SAC^{π^*}/ε^2)$ under small regularization $η= \tildeΩ(ε^{-1})$, where $η$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{π^*}$ policy coverage coefficient at the optimal policy $π^*$, $ε$ is the desired sub-optimality, and $\tilde{O}$ and $\tildeΩ$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.

Problem

Research questions and friction points this paper is trying to address.

offline multi-armed bandits

KL regularization

sample complexity

optimal policy

policy coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

KL regularization

offline multi-armed bandits

sample complexity