🤖 AI Summary
This paper studies a multi-armed bandit (MAB) problem where strategic agents serve as “arms”: each agent can manipulate reported rewards and incurred costs, necessitating an incentive-compatible mechanism that elicits high-performance truthful behavior while ensuring robust performance under non-equilibrium behavior (e.g., irrationality or deviations). To this end, we propose the first MAB framework jointly guaranteeing incentive compatibility and non-equilibrium robustness. We identify a key structural property enabling synergistic achievement of both objectives and integrate insights from second-price auctions to handle settings with no prior knowledge of arm qualities. Theoretically, our algorithm yields a non-vacuous lower bound on cumulative reward under arbitrary agent behavior; moreover, even without knowledge of true arm performances, it achieves an $O(sqrt{T})$ regret upper bound—substantially improving upon conventional approaches that either ignore incentives or lack robustness guarantees.
📝 Abstract
Motivated by applications such as online labor markets we consider a variant of the stochastic multi-armed bandit problem where we have a collection of arms representing strategic agents with different performance characteristics. The platform (principal) chooses an agent in each round to complete a task. Unlike the standard setting, when an arm is pulled it can modify its reward by absorbing it or improving it at the expense of a higher cost. The principle has to solve a mechanism design problem to incentivize the arms to give their best performance. However, since even with an effective mechanism agents may still deviate from rational behavior, the principal wants a robust algorithm that also gives a non-vacuous guarantee on the total accumulated rewards under non-equilibrium behavior. In this paper, we introduce a class of bandit algorithms that meet the two objectives of performance incentivization and robustness simultaneously. We do this by identifying a collection of intuitive properties that a bandit algorithm has to satisfy to achieve these objectives. Finally, we show that settings where the principal has no information about the arms' performance characteristics can be handled by combining ideas from second price auctions with our algorithms.