🤖 AI Summary
This work addresses the challenge of decentralized multi-agent coordination in continuous Lipschitz action spaces, where hard collisions lead to zero rewards. To circumvent this issue without inter-agent communication, the authors propose a max-value-guided search strategy that steers agents toward high-reward regions while avoiding conflicts. The original problem is decomposed into independent single-agent Lipschitz bandits, enabling scalable optimization. The approach introduces a novel framework that decouples coordination cost from time and accommodates a general distance-threshold collision model. By integrating a modular protocol with a decentralized coordination mechanism—and relying solely on Lipschitz continuity assumptions—it achieves a near-optimal regret bound of $\tilde{O}(T^{(d+1)/(d+2)})$, matching the best-known performance of single-agent optimal strategies.
📝 Abstract
We study the decentralized multi-player stochastic bandit problem over a continuous, Lipschitz-structured action space where hard collisions yield zero reward. Our objective is to design a communication-free policy that maximizes collective reward, with coordination costs that are independent of the time horizon $T$. We propose a modular protocol that first solves the multi-agent coordination problem -- identifying and seating players on distinct high-value regions via a novel maxima-directed search -- and then decouples the problem into $N$ independent single-player Lipschitz bandits. We establish a near-optimal regret bound of $\tilde{O}(T^{(d+1)/(d+2)})$ plus a $T$-independent coordination cost, matching the single-player rate. To our knowledge, this is the first framework providing such guarantees, and it extends to general distance-threshold collision models.