Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

272K/year

🤖 AI Summary

This paper studies the multi-armed bandit (MAB) problem under delayed feedback, aiming to unify treatment of stochastic and adversarial environments within a Best-of-Both-Worlds framework and achieve near-optimal regret bounds. We propose a novel algorithm that, for the first time in the delayed setting, simultaneously matches the known lower bounds in both regimes: it attains the optimal adversarial regret $widetilde{O}(sqrt{KT} + sqrt{D})$, where $K$ is the number of arms, $T$ the horizon, and $D$ the total delay; and in the stochastic setting, it achieves the tight bound $sum_i frac{log T}{Delta_i} + frac{1}{K}sum_i Delta_i sigma_{max}$, improving the second term by a factor of $K$ over prior work. Key technical innovations include adaptive confidence interval calibration, delay-aware exploration-exploitation trade-off design, and a dual-mode regret analysis. This work closes a long-standing gap between existing stochastic delayed MAB results and their theoretical lower bounds.

Technology Category

Application Category

📝 Abstract

We study the multi-armed bandit problem with adversarially chosen delays in the Best-of-Both-Worlds (BoBW) framework, which aims to achieve near-optimal performance in both stochastic and adversarial environments. While prior work has made progress toward this goal, existing algorithms suffer from significant gaps to the known lower bounds, especially in the stochastic settings. Our main contribution is a new algorithm that, up to logarithmic factors, matches the known lower bounds in each setting individually. In the adversarial case, our algorithm achieves regret of $widetilde{O}(sqrt{KT} + sqrt{D})$, which is optimal up to logarithmic terms, where $T$ is the number of rounds, $K$ is the number of arms, and $D$ is the cumulative delay. In the stochastic case, we provide a regret bound which scale as $sum_{i:Delta_i>0}left(log T/Delta_i ight) + frac{1}{K}sum Delta_i sigma_{max}$, where $Delta_i$ is the sub-optimality gap of arm $i$ and $sigma_{max}$ is the maximum number of missing observations. To the best of our knowledge, this is the first BoBW algorithm to simultaneously match the lower bounds in both stochastic and adversarial regimes in delayed environment. Moreover, even beyond the BoBW setting, our stochastic regret bound is the first to match the known lower bound under adversarial delays, improving the second term over the best known result by a factor of $K$.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-armed bandit regret with adversarial delays

Bridging performance gaps in stochastic and adversarial settings

Matching lower bounds for delayed feedback in both environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal regret in adversarial delayed feedback

Matching lower bounds in stochastic settings

First BoBW algorithm for delayed environments

🔎 Similar Papers

Biased Dueling Bandits with Stochastic Delayed Feedback