🤖 AI Summary
Classical UCB algorithms require strong moment conditions (e.g., bounded fourth moments) and thus fail under heavy-tailed reward distributions commonly encountered in finance and recommendation systems.
Method: We propose the Generalized Robust UCB (GR-UCB) framework, the first to extend robust UCB to settings where only *p*-th and *q*-th moments exist for arbitrary $1 < q < p$, eliminating the restrictive prior assumption $p=4,q=2$. GR-UCB requires no prior knowledge of high-order moment bounds—only mild existence assumptions—and employs truncated mean estimation, adaptive moment control, and a refined confidence interval construction.
Results: Leveraging sharp concentration inequalities, we establish a rigorous $O(log T)$ asymptotically optimal cumulative regret bound. GR-UCB significantly broadens the applicability and theoretical robustness of UCB-type algorithms in heavy-tailed environments, outperforming existing robust UCB methods both theoretically and in practical scope.
📝 Abstract
The multi-armed bandit (MAB) problem is a widely studied model in the field of operations research for sequential decision making and reinforcement learning. This paper mainly considers the classical MAB model with the heavy-tailed reward distributions. We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies proposed by Bubeck et al. [5] and Lattimore [21]. The previous UCB policies require the knowledge of an upper bound on specific moments of reward distributions or a particular moment to exist, which can be hard to acquire or guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders $p=4$ and $q=2$) to arbitrarily chosen $p$ and $q$ as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order O(log T), thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions.