🤖 AI Summary
This paper investigates the asymptotic behavior and regret performance of the variance-aware UCB-V algorithm for the multi-armed bandit (MAB) problem. Addressing the lack of precise characterization of arm-selection frequencies and uncertainty regarding convergence, we establish, for the first time, an exact asymptotic characterization of UCB-V’s arm-selection rates—revealing potential non-deterministic convergence phenomena. Building upon this, we derive the first high-probability non-asymptotic bound on arm-selection rates. Leveraging this bound, we obtain a refined regret upper bound of order $O(sqrt{T log T})$, which improves upon the empirical performance of classical UCB under heterogeneous variance settings. Notably, this is the first non-asymptotic regret bound of this order achieved by any variance-aware algorithm, substantially advancing the theoretical understanding and performance guarantees for UCB-V.
📝 Abstract
In this paper, we study the behavior of the Upper Confidence Bound-Variance (UCB-V) algorithm for the Multi-Armed Bandit (MAB) problems, a variant of the canonical Upper Confidence Bound (UCB) algorithm that incorporates variance estimates into its decision-making process. More precisely, we provide an asymptotic characterization of the arm-pulling rates for UCB-V, extending recent results for the canonical UCB in Kalvit and Zeevi (2021) and Khamaru and Zhang (2024). In an interesting contrast to the canonical UCB, our analysis reveals that the behavior of UCB-V can exhibit instability, meaning that the arm-pulling rates may not always be asymptotically deterministic. Besides the asymptotic characterization, we also provide non-asymptotic bounds for the arm-pulling rates in the high probability regime, offering insights into the regret analysis. As an application of this high probability result, we establish that UCB-V can achieve a more refined regret bound, previously unknown even for more complicate and advanced variance-aware online decision-making algorithms.