Optimism in the Face of Ambiguity Principle for Multi-Armed Bandits

📅 2024-09-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This paper addresses the multi-armed bandit problem under hybrid adversarial-stochastic reward distributions with unknown structure. We propose Fuzzy-FTPL, a novel Follow-The-Perturbed-Leader algorithm incorporating fuzzy distributional perturbations. Methodologically, we introduce the “principle of optimism under fuzziness,” enabling robust decision-making against unknown but bounded perturbations; further, we unify the theoretical elegance of Follow-The-Regularized-Leader (FTRL) with the computational efficiency of FTPL, achieving, for the first time, full coverage of mainstream optimal FTRL algorithms within a single framework. Our theoretical contributions include: (i) establishing a new paradigm of fuzzy robustness modeling, attaining the optimal regret bound $O(sqrt{KT})$; (ii) significantly improving computational efficiency—up to $10^4 imes$ faster than standard FTRL; and (iii) rigorously proving the optimality-equivalence between FTPL and FTRL, resolving a long-standing open problem in online learning.

Technology Category

Application Category

📝 Abstract

Follow-The-Regularized-Leader (FTRL) algorithms often enjoy optimal regret for adversarial as well as stochastic bandit problems and allow for a streamlined analysis. Nonetheless, FTRL algorithms require the solution of an optimization problem in every iteration and are thus computationally challenging. In contrast, Follow-The-Perturbed-Leader (FTPL) algorithms achieve computational efficiency by perturbing the estimates of the rewards of the arms, but their regret analysis is cumbersome. We propose a new FTPL algorithm that generates optimal policies for both adversarial and stochastic multi-armed bandits. Like FTRL, our algorithm admits a unified regret analysis, and similar to FTPL, it offers low computational costs. Unlike existing FTPL algorithms that rely on independent additive disturbances governed by a extit{known} distribution, we allow for disturbances governed by an extit{ambiguous} distribution that is only known to belong to a given set and propose a principle of optimism in the face of ambiguity. Consequently, our framework generalizes existing FTPL algorithms. It also encapsulates a broad range of FTRL methods as special cases, including several optimal ones, which appears to be impossible with current FTPL methods. Finally, we use techniques from discrete choice theory to devise an efficient bisection algorithm for computing the optimistic arm sampling probabilities. This algorithm is up to $10^4$ times faster than standard FTRL algorithms that solve an optimization problem in every iteration. Our results not only settle existing conjectures but also provide new insights into the impact of perturbations by mapping FTRL to FTPL.

Problem

Research questions and friction points this paper is trying to address.

Optimize multi-armed bandit algorithms.

Unify regret analysis across adversarial and stochastic bandits.

Enhance computational efficiency in bandit algorithms.

Innovation

Methods, ideas, or system contributions that make the work stand out.

FTPL algorithm with optimism

handles ambiguous reward distributions

efficient bisection for sampling

🔎 Similar Papers

No similar papers found.