🤖 AI Summary
This paper studies collaborative exploration in multi-agent bandits under two key challenges: unknown activation thresholds and hidden decoy arms—where rewards are triggered only when at least $k$ agents simultaneously select the same arm, and some arms yield no true reward. It is the first work to jointly model threshold uncertainty and decoy interference. We propose T-Coop-UCB, a decentralized algorithm built upon a distributed UCB framework that concurrently learns both the activation threshold and the true reward distributions of arms. Leveraging joint confidence intervals and an adaptive coalition formation mechanism, it avoids unproductive collaboration. Theoretical analysis establishes a sublinear regret bound. Experiments demonstrate that T-Coop-UCB significantly outperforms baselines in cumulative reward, coordination efficiency, and regret, closely approaching oracle performance—thereby validating the efficacy of joint threshold learning and decoy identification.
📝 Abstract
Cooperative multi-agent systems often face tasks that require coordinated actions under uncertainty. While multi-armed bandit (MAB) problems provide a powerful framework for decentralized learning, most prior work assumes individually attainable rewards. We address the challenging setting where rewards are threshold-activated: an arm yields a payoff only when a minimum number of agents pull it simultaneously, with this threshold unknown in advance. Complicating matters further, some arms are decoys - requiring coordination to activate but yielding no reward - introducing a new challenge of wasted joint exploration. We introduce Threshold-Coop-UCB (T-Coop-UCB), a decentralized algorithm that enables agents to jointly learn activation thresholds and reward distributions, forming effective coalitions without centralized control. Empirical results show that T-Coop-UCB consistently outperforms baseline methods in cumulative reward, regret, and coordination metrics, achieving near-Oracle performance. Our findings underscore the importance of joint threshold learning and decoy avoidance for scalable, decentralized cooperation in complex multi-agent