Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

This work addresses the challenge of achieving both minimax-optimal worst-case regret ($\widetilde{O}(\sqrt{T} + \sqrt{D})$) and near-constant regret ($\widetilde{O}(1)$) relative to a prescribed “safe” baseline policy in adversarial multi-armed bandits, with or without feedback delays. To this end, the authors propose the Prudent-Banker algorithm, which integrates delay-adapted online mirror descent, a phased aggressive exploration mechanism, and delay-calibrated restart thresholds. This approach is the first to attain an optimal trade-off between safety and robustness in delayed adversarial environments. The method is supported by rigorous theoretical analysis and empirically validated across diverse delay distributions.

📝 Abstract

We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee. To bridge this gap, we propose Prudent-Banker, a novel algorithm that combines a delay-adapted variant of Online Mirror Descent with a modified phased-aggression mechanism. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. We also establish new lower bounds for safety-constrained adversarial delayed bandits, showing that the regret guarantees of Prudent-Banker are unimprovable, up to logarithmic factors, under the baseline-safety requirement. To the best of our knowledge, Prudent-Banker is the first algorithm to achieve the optimal safety--robustness trade-off: pseudo-regret $\widetilde{O}(\sqrt{T}+\sqrt{D})$ together with $\widetilde{O}(1)$ regret against the safe comparator, both with and without delays. Experiments across diverse delay distributions show that, unlike standard delay-robust baselines, Prudent-Banker effectively balances safety and learning.

Problem

Research questions and friction points this paper is trying to address.

adversarial bandits

delayed feedback

safety constraint

baseline regret

minimax regret

Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial bandits

delayed feedback

safety-constrained learning