Decentralized Asynchronous Multi-player Bandits

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper studies the decentralized asynchronous multi-player multi-armed bandit (MP-MAB) problem, where players dynamically join and leave, operate without a global clock, and cannot directly observe the total number of players—making collision avoidance and state tracking highly challenging. To address this, we propose the first fully asynchronous, synchronization-signal-free decentralized algorithm: it employs uniform exploration to mitigate collisions and persistently probes occupied arms with small probability to implicitly detect player departures. We establish a regret bound of $O(sqrt{T log T} + log T / Delta^2)$, achieving optimal logarithmic growth. Experiments confirm its robustness and efficiency in dynamic environments. Our key contributions are: (i) the first asynchronous solution enabling adaptive estimation of the unknown player count, and (ii) conflict-free learning without requiring explicit coordination or global knowledge—yielding a deployable distributed decision-making framework for real-world applications such as cognitive radio and IoT networks.

Technology Category

Application Category

📝 Abstract

In recent years, multi-player multi-armed bandits (MP-MAB) have been extensively studied due to their wide applications in cognitive radio networks and Internet of Things systems. While most existing research on MP-MAB focuses on synchronized settings, real-world systems are often decentralized and asynchronous, where players may enter or leave the system at arbitrary times, and do not have a global clock. This decentralized asynchronous setting introduces two major challenges. First, without a global time, players cannot implicitly coordinate their actions through time, making it difficult to avoid collisions. Second, it is important to detect how many players are in the system, but doing so may cost a lot. In this paper, we address the challenges posed by such a fully asynchronous setting in a decentralized environment. We develop a novel algorithm in which players adaptively change between exploration and exploitation. During exploration, players uniformly pull their arms, reducing the probability of collisions and effectively mitigating the first challenge. Meanwhile, players continue pulling arms currently exploited by others with a small probability, enabling them to detect when a player has left, thereby addressing the second challenge. We prove that our algorithm achieves a regret of $mathcal{O}(sqrt{T log T} + {log T}/{Δ^2})$, where $Δ$ is the minimum expected reward gap between any two arms. To the best of our knowledge, this is the first efficient MP-MAB algorithm in the asynchronous and decentralized environment. Extensive experiments further validate the effectiveness and robustness of our algorithm, demonstrating its applicability to real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses decentralized asynchronous multi-player bandit coordination challenges

Develops algorithm to minimize collisions without global synchronization

Enables dynamic player detection and efficient resource allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Players adaptively switch between exploration and exploitation phases

Uniform arm pulling during exploration reduces collision probability

Small probability arm pulling detects player departures

🔎 Similar Papers

No similar papers found.

Authors to Follow