Achieving adaptivity and optimality for multi-armed bandits using Exponential-Kullback Leiblier Maillard Sampling

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the single-parameter exponential-family multi-armed bandit problem and proposes ExpKLMS—the first algorithm simultaneously achieving multiple theoretical optimality criteria. ExpKLMS leverages exact KL-divergence modeling, Maillard-type confidence upper bounds, and variance-aware sampling. It attains asymptotic optimality (A.O.), minimax optimality up to logarithmic factors (M.O.), the Sub-UCB property, and a variance-adaptive worst-case regret bound of $O(sqrt{T cdot mathrm{Var}})$. All guarantees are rigorously established within the OPED (Optimism-Pessimism-Exploration-Dependence) framework, which unifies the analysis and demonstrates ExpKLMS’s strict superiority over classical Thompson Sampling (TS) and UCB-based algorithms. By reconciling previously conflicting objectives—such as asymptotic efficiency, finite-time robustness, and variance adaptation—ExpKLMS breaks the longstanding trade-off bottleneck inherent in existing bandit methodologies.

Technology Category

Application Category

📝 Abstract
We study the problem of Multi-Armed Bandits (MAB) with reward distributions belonging to a One-Parameter Exponential Distribution (OPED) family. In the literature, several criteria have been proposed to evaluate the performance of such algorithms, including Asymptotic Optimality (A.O.), Minimax Optimality (M.O.), Sub-UCB, and variance-adaptive worst-case regret bound. Thompson Sampling (TS)-based and Upper Confidence Bound (UCB)-based algorithms have been employed to achieve some of these criteria. However, none of these algorithms simultaneously satisfy all the aforementioned criteria. In this paper, we design an algorithm, Exponential Kullback-Leibler Maillard Sampling (abbrev. expklms), that can achieve multiple optimality criteria simultaneously, including A.O., M.O. with a logarithmic factor, Sub-UCB, and variance-adaptive worst-case regret bound.
Problem

Research questions and friction points this paper is trying to address.

Multi-Armed Bandits optimization
Exponential-Kullback Leiblier Maillard Sampling
Simultaneous optimality criteria achievement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exponential-Kullback-Leiblier Maillard Sampling
Simultaneous optimality criteria achievement
Variance-adaptive worst-case regret bound
🔎 Similar Papers
No similar papers found.