The Extended UCB Policies for Frequentist Multi-armed Bandit Problems

📅 2011-12-08

📈 Citations: 1

✨ Influential: 0

career value

278K/year

🤖 AI Summary

Classical UCB algorithms require strong moment conditions (e.g., bounded fourth moments) and thus fail under heavy-tailed reward distributions commonly encountered in finance and recommendation systems. Method: We propose the Generalized Robust UCB (GR-UCB) framework, the first to extend robust UCB to settings where only *p*-th and *q*-th moments exist for arbitrary $1 < q < p$, eliminating the restrictive prior assumption $p=4,q=2$. GR-UCB requires no prior knowledge of high-order moment bounds—only mild existence assumptions—and employs truncated mean estimation, adaptive moment control, and a refined confidence interval construction. Results: Leveraging sharp concentration inequalities, we establish a rigorous $O(log T)$ asymptotically optimal cumulative regret bound. GR-UCB significantly broadens the applicability and theoretical robustness of UCB-type algorithms in heavy-tailed environments, outperforming existing robust UCB methods both theoretically and in practical scope.

📝 Abstract

The multi-armed bandit (MAB) problem is a widely studied model in the field of operations research for sequential decision making and reinforcement learning. This paper mainly considers the classical MAB model with the heavy-tailed reward distributions. We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies proposed by Bubeck et al. [5] and Lattimore [21]. The previous UCB policies require the knowledge of an upper bound on specific moments of reward distributions or a particular moment to exist, which can be hard to acquire or guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders $p=4$ and $q=2$) to arbitrarily chosen $p$ and $q$ as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order O(log T), thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions.

Problem

Research questions and friction points this paper is trying to address.

Extends UCB policies for heavy-tailed reward distributions

Generalizes moment conditions to arbitrary p>q>1

Achieves near-optimal regret without distribution knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extended robust UCB for heavy-tailed rewards

Generalizes UCB to arbitrary moment orders

Achieves near-optimal regret without distribution knowledge

🔎 Similar Papers

Exploiting Adjacent Similarity in Multi-Armed Bandit Tasks via Transfer of Reward Samples