π€ AI Summary
This paper studies covert optimal-arm identification in multi-armed bandits under observer surveillance: each arm has both public and private reward distributions; the agent observes both, whereas the observer only sees public rewards and the pull sequenceβand expects Thompson sampling. The agent aims to rapidly identify the privately optimal arm while minimizing detectability. To this end, we model detectability via KL divergence, proving that the number of pulls of publicly suboptimal arms is bounded by Ξ(βT). We derive a mean-driven minimax error bound characterizing the optimal identification exponent, incorporating both public and private means. Inspired by top-two sampling, we design an adaptive covert exploration algorithm whose exploration intensity dynamically scales with the public suboptimality gap. We prove the algorithm achieves the optimal covert rate, and empirical results confirm its Ξ(βT) pull behavior and robustness against detection.
π Abstract
We consider a multi-armed bandit setting in which each arm has a public and a private reward distribution. An observer expects an agent to follow Thompson Sampling according to the public rewards, however, the deceptive agent aims to quickly identify the best private arm without being noticed. The observer can observe the public rewards and the pulled arms, but not the private rewards. The agent, on the other hand, observes both the public and private rewards. We formalize detectability as a stepwise Kullback-Leibler (KL) divergence constraint between the actual pull probabilities used by the agent and the anticipated pull probabilities by the observer. We model successful pulling of public suboptimal arms as a % Bernoulli process where the success probability decreases with each successful pull, and show these pulls can happen at most at a $Ξ(sqrt{T}) $ rate under the KL constraint. We then formulate a maximin problem based on public and private means, whose solution characterizes the optimal error exponent for best private arm identification. We finally propose an algorithm inspired by top-two algorithms. This algorithm naturally adapts its exploration according to the hardness of pulling arms based on the public suboptimality gaps. We provide numerical examples illustrating the $Ξ(sqrt{T}) $ rate and the behavior of the proposed algorithm.