High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

📅 2024-10-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper studies the contextual multi-armed bandit problem under adversarial environments, where contexts are independent and identically distributed (i.i.d.) but unknown. The adversary may impose arbitrary losses, and no prior knowledge of the context distribution is available. We propose an enhanced variant of the Schneider–Zimmert algorithm tailored to this setting. Our key contribution is the first high-probability near-optimality guarantee for such an algorithm: by characterizing weak inter-period dependencies, we refine classical martingale inequalities—overcoming the limitation of prior analyses that only yield expected regret bounds. As a result, we establish a high-probability regret bound of $O(sqrt{T}log T)$, which strictly improves upon existing expected-regret guarantees and requires no assumptions or prior information about the context distribution.

Technology Category

Application Category

📝 Abstract

Motivated by applications in online bidding and sleeping bandits, we examine the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts, not just the current round's context. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d. from a specific distribution. This problem was first studied by Balseiro et al. (2019), who proposed an algorithm that achieves near-optimal regret under the assumption that the context distribution is known in advance. However, this assumption is often unrealistic. To address this issue, Schneider and Zimmert (2023) recently proposed a new algorithm that achieves nearly optimal expected regret. It is well-known that expected regret can be significantly weaker than high-probability bounds. In this paper, we present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability. There are steps in the original analysis by Schneider and Zimmert (2023) that lead only to an expected bound by nature. In our analysis, we introduce several new insights. Specifically, we make extensive use of the weak dependency structure between different epochs, which was overlooked in previous analyses. Additionally, standard martingale inequalities are not directly applicable, so we refine martingale inequalities to complete our analysis.

Problem

Research questions and friction points this paper is trying to address.

Multi-Armed Bandit Problems

Adversarial Settings

Optimal Strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Armed Bandit Problems

Cross-Learning Context

Near-Optimal Strategy

🔎 Similar Papers

Fast and Sample Efficient Multi-Task Representation Learning in Stochastic Contextual Bandits