Distributed Algorithms for Multi-Agent Multi-Armed Bandits with Collision

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the distributed multi-player multi-armed bandit (MMAB) problem, where decentralized players independently select arms; collisions yield zero reward for all colliding players, and each observes only its own action and a collision indicator. To address this challenge, we propose the first distributed algorithm achieving near-optimal group regret and individual regret simultaneously. Our approach introduces an adaptive, lightweight communication protocol with merely $O(log log T)$ communication overhead. We further extend theoretical guarantees to periodic asynchronous settings—establishing the first such result—and provide a matching lower bound. The algorithm employs a decentralized exploration-exploitation trade-off mechanism driven solely by collision feedback, requiring no global information or explicit coordination. Experiments demonstrate logarithmic regret in both synchronous and asynchronous regimes, with individual regret substantially outperforming existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
We study the stochastic Multiplayer Multi-Armed Bandit (MMAB) problem, where multiple players select arms to maximize their cumulative rewards. Collisions occur when two or more players select the same arm, resulting in no reward, and are observed by the players involved. We consider a distributed setting without central coordination, where each player can only observe their own actions and collision feedback. We propose a distributed algorithm with an adaptive, efficient communication protocol. The algorithm achieves near-optimal group and individual regret, with a communication cost of only $mathcal{O}(loglog T)$. Our experiments demonstrate significant performance improvements over existing baselines. Compared to state-of-the-art (SOTA) methods, our approach achieves a notable reduction in individual regret. Finally, we extend our approach to a periodic asynchronous setting, proving the lower bound for this problem and presenting an algorithm that achieves logarithmic regret.
Problem

Research questions and friction points this paper is trying to address.

Distributed multi-agent bandit learning without central coordination
Addressing collision avoidance in multiplayer multi-armed bandits
Achieving near-optimal regret with minimal communication overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed algorithm with adaptive communication protocol
Achieves near-optimal regret with logarithmic communication cost
Extends to asynchronous setting with logarithmic regret guarantee
🔎 Similar Papers
2024-05-25arXiv.orgCitations: 1
D
Daoyuan Zhou
School of Artificial Intelligence, Nanjing University, China
Xuchuang Wang
Xuchuang Wang
UMass Amherst
Performance EvaluationOnline OptimizationQuantum NetworkMulti-Agent System
L
Lin Yang
School of Artificial Intelligence, Nanjing University, China
Y
Yang Gao
School of Artificial Intelligence, Nanjing University, China