🤖 AI Summary
This work addresses cooperative multi-agent reinforcement learning under submodular joint rewards, where the marginal gains of adding agents diminish. It establishes the first formal theoretical framework for this setting and proposes efficient policy optimization and learning algorithms. When the environment dynamics are known, a polynomial-time greedy policy optimization method is designed, achieving a 1/2 approximation ratio. In the unknown dynamics setting, an online learning algorithm integrating an upper confidence bound (UCB) mechanism is developed, yielding a 1/2-regret bound of $O(H^2 K S \sqrt{A T})$. By circumventing the curse of dimensionality in the joint policy space, this approach significantly improves sample efficiency and provides the first theoretically guaranteed solution for multi-agent coordination with submodular rewards.
📝 Abstract
In this paper, we study cooperative multi-agent reinforcement learning (MARL) where the joint reward exhibits submodularity, which is a natural property capturing diminishing marginal returns when adding agents to a team. Unlike standard MARL with additive rewards, submodular rewards model realistic scenarios where agent contributions overlap (e.g., multi-drone surveillance, collaborative exploration). We provide the first formal framework for this setting and develop algorithms with provable guarantees on sample efficiency and regret bound. For known dynamics, our greedy policy optimization achieves a $1/2$-approximation with polynomial complexity in the number of agents $K$, overcoming the exponential curse of dimensionality inherent in joint policy optimization. For unknown dynamics, we propose a UCB-based learning algorithm achieving a $1/2$-regret of $O(H^2KS\sqrt{AT})$ over $T$ episodes.