π€ AI Summary
This paper studies gap-dependent regret minimization for single-pass streaming stochastic multi-armed bandits (MAB) under memory constraints: $n$ arms arrive sequentially, and only $m < n$ armsβ statistics can be stored. Addressing the challenges of unknown gaps $Delta_i$ and the lack of non-asymptotic analysis, we establish the first tight non-asymptotic upper and matching lower bounds that fully characterize the joint dependence on $n$, $m$, horizon $T$, and $Delta_i$. Our method introduces adaptive arm elimination and gap-aware sampling, coupled with piecewise analysis and a carefully constructed hard instance. We derive optimal regret bounds for two memory regimes: when $m geq 2n/3$, the bound is $ ilde{O}ig((n-m) T^{1/(alpha+1)} sum_i Delta_i^{1-2alpha} / n^{1+1/(alpha+1)}ig)$; otherwise, it is $ ilde{O}ig(T^{1/(alpha+1)} sum_i Delta_i^{1-2alpha} / m^{1/(alpha+1)}ig)$. These results correct prior workβs erroneous order dependencies.
π Abstract
We study the problem of minimizing gap-dependent regret for single-pass streaming stochastic multi-armed bandits (MAB). In this problem, the $n$ arms are present in a stream, and at most $m<n$ arms and their statistics can be stored in the memory. We establish tight non-asymptotic regret bounds regarding all relevant parameters, including the number of arms $n$, the memory size $m$, the number of rounds $T$ and $(Delta_i)_{iin [n]}$ where $Delta_i$ is the reward mean gap between the best arm and the $i$-th arm. These gaps are not known in advance by the player. Specifically, for any constant $alpha ge 1$, we present two algorithms: one applicable for $mge frac{2}{3}n$ with regret at most $O_alphaBig(frac{(n-m)T^{frac{1}{alpha + 1}}}{n^{1 + {frac{1}{alpha + 1}}}}displaystylesum_{i:Delta_i>0}Delta_i^{1 - 2alpha}Big)$ and another applicable for $m<frac{2}{3}n$ with regret at most $O_alphaBig(frac{T^{frac{1}{alpha+1}}}{m^{frac{1}{alpha+1}}}displaystylesum_{i:Delta_i>0}Delta_i^{1 - 2alpha}Big)$. We also prove matching lower bounds for both cases by showing that for any constant $alphage 1$ and any $mleq k<n$, there exists a set of hard instances on which the regret of any algorithm is $Omega_alphaBig(frac{(k-m+1) T^{frac{1}{alpha+1}}}{k^{1 + frac{1}{alpha+1}}} sum_{i:Delta_i>0}Delta_i^{1-2alpha}Big)$. This is the first tight gap-dependent regret bound for streaming MAB. Prior to our work, an $OBig(sum_{icolonDelta>0} frac{sqrt{T}log T}{Delta_i}Big)$ upper bound for the special case of $alpha=1$ and $m=O(1)$ was established by Agarwal, Khanna and Patil (COLT'22). In contrast, our results provide the correct order of regret as $ThetaBig(frac{1}{sqrt{m}}sum_{icolonDelta>0}frac{sqrt{T}}{Delta_i}Big)$.