🤖 AI Summary
In contextual bandits, adaptive data collection invalidates classical least-squares inference, necessitating an “adaptivity penalty” (e.g., √(d log T) inflation) for valid confidence intervals. This work proposes an L₁-penalized EXP4 policy that achieves minimax-optimal regret O(√(dT)) under the Lai–Wei stability condition. Crucially, it is the first method to attain *zero-inflation asymptotic normality* in non-i.i.d. adaptive settings—enabling Wald-type confidence intervals with exact 1−α coverage without inflation. Theoretically, the algorithm simultaneously guarantees statistical validity (asymptotic normality and consistent variance estimation) and regret optimality. Extensive simulations confirm accurate coverage and tight confidence bounds. The core contribution is breaking the traditional accuracy–efficiency trade-off in adaptive inference: the method unifies algorithmic stability (for low regret) and statistical efficiency (for precise, non-inflated inference), thereby reconciling optimal decision-making with rigorous uncertainty quantification.
📝 Abstract
Statistical inference in contextual bandits is complicated by the adaptive, non-i.i.d. nature of the data. A growing body of work has shown that classical least-squares inference may fail under adaptive sampling, and that constructing valid confidence intervals for linear functionals of the model parameter typically requires paying an unavoidable inflation of order $sqrt{d log T}$. This phenomenon -- often referred to as the price of adaptivity -- highlights the inherent difficulty of reliable inference under general contextual bandit policies.
A key structural property that circumvents this limitation is the emph{stability} condition of Lai and Wei, which requires the empirical feature covariance to concentrate around a deterministic limit. When stability holds, the ordinary least-squares estimator satisfies a central limit theorem, and classical Wald-type confidence intervals -- designed for i.i.d. data -- become asymptotically valid even under adaptation, emph{without} incurring the $sqrt{d log T}$ price of adaptivity.
In this paper, we propose and analyze a penalized EXP4 algorithm for linear contextual bandits. Our first main result shows that this procedure satisfies the Lai--Wei stability condition and therefore admits valid Wald-type confidence intervals for linear functionals. Our second result establishes that the same algorithm achieves regret guarantees that are minimax optimal up to logarithmic factors, demonstrating that stability and statistical efficiency can coexist within a single contextual bandit method. Finally, we complement our theory with simulations illustrating the empirical normality of the resulting estimators and the sharpness of the corresponding confidence intervals.