š¤ AI Summary
This work addresses the tension between reproducibility and exploration efficiency in stochastic multi-armed and linear bandits by proposing the first optimism-based reproducible algorithmic framework. Introducing a reproducible ridge regression estimator (RepRidge) and a batched UCB strategy, the authors design RepUCB for the multi-armed setting and RepLinUCB for the linear setting, both of which generate consistent action sequences with high probability under shared randomness without requiring action discretization. Theoretical analysis establishes a clear regret bound for RepUCB, while RepLinUCB achieves a regret bound of Ć((d + d³/Ļ)āT), improving upon the best-known results by a factor of O(d/Ļ) and significantly reducing dependence on the dimension d and the reproducibility parameter Ļ.
š Abstract
We study replicable algorithms for stochastic multi-armed bandits (MAB) and linear bandits with UCB (Upper Confidence Bound) based exploration. A bandit algorithm is $Ļ$-replicable if two executions using shared internal randomness but independent reward realizations, produce the same action sequence with probability at least $1-Ļ$. Prior work is primarily elimination-based and, in linear bandits with infinitely many actions, relies on discretization, leading to suboptimal dependence on the dimension $d$ and $Ļ$. We develop optimistic alternatives for both settings. For stochastic multi-armed bandits, we propose RepUCB, a replicable batched UCB algorithm and show that it attains a regret $O\!\left(\frac{K^2\log^2 T}{Ļ^2}\sum_{a:Ī_a>0}\left(Ī_a+\frac{\log(KT\log T)}{Ī_a}\right)\right)$. For stochastic linear bandits, we first introduce RepRidge, a replicable ridge regression estimator that satisfies both a confidence guarantee and a $Ļ$-replicability guarantee. Beyond its role in our bandit algorithm, this estimator and its guarantees may also be of independent interest in other statistical estimation settings. We then use RepRidge to design RepLinUCB, a replicable optimistic algorithm for stochastic linear bandits, and show that its regret is bounded by $\widetilde{O}\!\big(\big(d+\frac{d^3}Ļ\big)\sqrt{T}\big)$. This improves the best prior regret guarantee by a factor of $O(d/Ļ)$, showing that our optimistic algorithm can substantially reduce the price of replicability.