🤖 AI Summary
This work investigates the problem of best-arm identification in multi-armed bandits under unknown reward mechanisms, which may be either stochastic or adversarial. We first establish that it is impossible to achieve optimality simultaneously in both stochastic and adversarial environments, and we characterize a fundamental lower bound on the error probability in stochastic settings under the constraint of adversarial robustness. Building on this insight, we propose the first parameter-free algorithm that adaptively operates without prior knowledge of the reward nature: in stochastic environments, its error probability matches the theoretical lower bound up to a logarithmic factor, while maintaining robustness in adversarial environments. By integrating techniques from best-arm identification theory, minimax lower bound analysis, and parameter-free online learning, our approach achieves breakthroughs both theoretically and empirically.
📝 Abstract
We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.