🤖 AI Summary
This paper introduces a novel problem in strategic multi-armed bandits: *identifying the arm with the second-highest true mean reward*. Arms are self-interested agents that may misreport rewards to maximize their utility, while the learner aims to estimate the second-highest true mean over $T$ rounds while minimizing regret. We propose the first **incentive-compatible mechanism without liability constraints**, compelling all arms to truthfully report and fully disclose their realized rewards—eliminating reliance on debt guarantees required by prior mechanisms. Theoretical analysis establishes a problem-dependent regret bound of $O(log T / Delta)$ and a worst-case bound of $O(sqrt{T log T})$; asymptotically, the learner almost surely converges to the second-highest true mean reward. This work pioneers the integration of mechanism design with suboptimal (ordinal) objective identification, establishing a new paradigm for robust ordinal learning in strategic environments.
📝 Abstract
We consider the classical multi-armed bandit problem, but with strategic arms. In this context, each arm is characterized by a bounded support reward distribution and strategically aims to maximize its own utility by potentially retaining a portion of its reward, and disclosing only a fraction of it to the learning agent. This scenario unfolds as a game over $T$ rounds, leading to a competition of objectives between the learning agent, aiming to minimize their regret, and the arms, motivated by the desire to maximize their individual utilities. To address these dynamics, we introduce a new mechanism that establishes an equilibrium wherein each arm behaves truthfully and discloses as much of its rewards as possible. With this mechanism, the agent can attain the second-highest average (true) reward among arms, with a cumulative regret bounded by $O(log(T)/Delta)$ (problem-dependent) or $O(sqrt{Tlog(T)})$ (worst-case).