🤖 AI Summary
This paper studies the stochastic linear multi-armed bandit problem under heavy-tailed noise. Existing approaches—such as truncation and median-of-means methods—rely on strong noise assumptions (e.g., bounded moments) or restrictive structural constraints (e.g., sparsity), while recent adaptive Huber regression achieves broader applicability but incurs high computational cost due to full historical data storage and per-round full-data sweeps. To address these limitations, we propose the first adaptive Huber algorithm embedded within an online mirror descent framework, enabling single-pass, memory-efficient updates: no historical data storage is required, and per-round computation is merely $widetilde{mathcal{O}}(1)$. Under minimal noise assumptions—only requiring finite $(1+epsilon)$-th moments—we establish a variance-aware, near-optimal regret bound of $widetilde{mathcal{O}}ig(d T^{frac{1-epsilon}{2(1+epsilon)}} sqrt{sum
u_t^2} + d T^{frac{1-epsilon}{2(1+epsilon)}}ig)$, eliminating structural dependencies inherent in prior methods.
📝 Abstract
We study the stochastic linear bandits with heavy-tailed noise. Two principled strategies for handling heavy-tailed noise, truncation and median-of-means, have been introduced to heavy-tailed bandits. Nonetheless, these methods rely on specific noise assumptions or bandit structures, limiting their applicability to general settings. The recent work [Huang et al.2024] develops a soft truncation method via the adaptive Huber regression to address these limitations. However, their method suffers undesired computational cost: it requires storing all historical data and performing a full pass over these data at each round. In this paper, we propose a emph{one-pass} algorithm based on the online mirror descent framework. Our method updates using only current data at each round, reducing the per-round computational cost from $widetilde{mathcal{O}}(t log T)$ to $widetilde{mathcal{O}}(1)$ with respect to current round $t$ and the time horizon $T$, and achieves a near-optimal and variance-aware regret of order $widetilde{mathcal{O}}ig(d T^{frac{1-epsilon}{2(1+epsilon)}} sqrt{sum_{t=1}^T
u_t^2} + d T^{frac{1-epsilon}{2(1+epsilon)}}ig)$ where $d$ is the dimension and $
u_t^{1+epsilon}$ is the $(1+epsilon)$-th central moment of reward at round $t$.