🤖 AI Summary
This work addresses the challenge of achieving both minimax-optimal worst-case regret ($\widetilde{O}(\sqrt{T} + \sqrt{D})$) and near-constant regret ($\widetilde{O}(1)$) relative to a prescribed “safe” baseline policy in adversarial multi-armed bandits, with or without feedback delays. To this end, the authors propose the Prudent-Banker algorithm, which integrates delay-adapted online mirror descent, a phased aggressive exploration mechanism, and delay-calibrated restart thresholds. This approach is the first to attain an optimal trade-off between safety and robustness in delayed adversarial environments. The method is supported by rigorous theoretical analysis and empirically validated across diverse delay distributions.
📝 Abstract
We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee. To bridge this gap, we propose Prudent-Banker, a novel algorithm that combines a delay-adapted variant of Online Mirror Descent with a modified phased-aggression mechanism. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. We also establish new lower bounds for safety-constrained adversarial delayed bandits, showing that the regret guarantees of Prudent-Banker are unimprovable, up to logarithmic factors, under the baseline-safety requirement. To the best of our knowledge, Prudent-Banker is the first algorithm to achieve the optimal safety--robustness trade-off: pseudo-regret $\widetilde{O}(\sqrt{T}+\sqrt{D})$ together with $\widetilde{O}(1)$ regret against the safe comparator, both with and without delays. Experiments across diverse delay distributions show that, unlike standard delay-robust baselines, Prudent-Banker effectively balances safety and learning.