🤖 AI Summary
This paper studies the distributed multi-player multi-armed bandit (MMAB) problem, where decentralized players independently select arms; collisions yield zero reward for all colliding players, and each observes only its own action and a collision indicator. To address this challenge, we propose the first distributed algorithm achieving near-optimal group regret and individual regret simultaneously. Our approach introduces an adaptive, lightweight communication protocol with merely $O(log log T)$ communication overhead. We further extend theoretical guarantees to periodic asynchronous settings—establishing the first such result—and provide a matching lower bound. The algorithm employs a decentralized exploration-exploitation trade-off mechanism driven solely by collision feedback, requiring no global information or explicit coordination. Experiments demonstrate logarithmic regret in both synchronous and asynchronous regimes, with individual regret substantially outperforming existing state-of-the-art methods.
📝 Abstract
We study the stochastic Multiplayer Multi-Armed Bandit (MMAB) problem, where multiple players select arms to maximize their cumulative rewards. Collisions occur when two or more players select the same arm, resulting in no reward, and are observed by the players involved. We consider a distributed setting without central coordination, where each player can only observe their own actions and collision feedback. We propose a distributed algorithm with an adaptive, efficient communication protocol. The algorithm achieves near-optimal group and individual regret, with a communication cost of only $mathcal{O}(loglog T)$. Our experiments demonstrate significant performance improvements over existing baselines. Compared to state-of-the-art (SOTA) methods, our approach achieves a notable reduction in individual regret. Finally, we extend our approach to a periodic asynchronous setting, proving the lower bound for this problem and presenting an algorithm that achieves logarithmic regret.