Heterogeneous Multi-Agent Bandits with Parsimonious Hints

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

This paper studies collaborative learning in heterogeneous multi-agent multi-armed bandits under “sparse hints”: agents may query low-cost hints—only when not pulling arms—to improve decision quality, aiming to jointly minimize both the number of hint queries and the time-independent pseudo-regret. We propose the first hint-driven heterogeneous multi-agent bandit framework and design three algorithms: GP-HCLA (centralized), and HD-ETC and EBHD-ETC (decentralized), integrating adaptive hint scheduling, gap-aware exploration, and collision-based decentralized communication. Theoretically, GP-HCLA achieves $O(M^4K)$ pseudo-regret and $O(MKlog T)$ hint queries; HD-ETC and EBHD-ETC attain $O(M^3K^2)$ pseudo-regret and $O(M^3Klog T)$ hint queries, respectively, with matching information-theoretic lower bounds established for both settings. Extensive simulations validate the theoretical optimality and practical efficacy of the proposed methods.

Technology Category

Application Category

📝 Abstract

We study a hinted heterogeneous multi-agent multi-armed bandits problem (HMA2B), where agents can query low-cost observations (hints) in addition to pulling arms. In this framework, each of the $M$ agents has a unique reward distribution over $K$ arms, and in $T$ rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm. The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time-independent regret. We study HMA2B in both centralized and decentralized setups. Our main centralized algorithm, GP-HCLA, which is an extension of HCLA, uses a central decision-maker for arm-pulling and hint queries, achieving $O(M^4K)$ regret with $O(MKlog T)$ adaptive hints. In decentralized setups, we propose two algorithms, HD-ETC and EBHD-ETC, that allow agents to choose actions independently through collision-based communication and query hints uniformly until stopping, yielding $O(M^3K^2)$ regret with $O(M^3Klog T)$ hints, where the former requires knowledge of the minimum gap and the latter does not. Finally, we establish lower bounds to prove the optimality of our results and verify them through numerical simulations.

Problem

Research questions and friction points this paper is trying to address.

Maximize utility with minimal hint queries

Achieve time-independent regret in HMA2B

Centralized and decentralized algorithm optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

GP-HCLA algorithm for centralized setups

HD-ETC and EBHD-ETC for decentralized setups

Minimizes regret with adaptive hint queries

🔎 Similar Papers

No similar papers found.