🤖 AI Summary
In open multi-agent systems, the problem of partially controllable n-agent temporary collaboration (NAHT) remains challenging, as existing approaches predominantly rely on heuristic designs lacking theoretical rigor and interpretable credit assignment mechanisms.
Method: This paper introduces the first integration of cooperative game theory—specifically, the Shapley value’s axiomatic foundation—into a multi-agent reinforcement learning framework. We propose an online Shapley value estimation algorithm grounded in TD(λ), enabling axiomatically compliant state-value decomposition under dynamic environments.
Contribution/Results: The method ensures both theoretical soundness and attribution interpretability. Empirical evaluation across multiple NAHT benchmark tasks demonstrates significant performance gains over heuristic baselines, validating the unified improvement in collaborative efficacy and principled credit allocation.
📝 Abstract
Open multi-agent systems are increasingly important in modeling real-world applications, such as smart grids, swarm robotics, etc. In this paper, we aim to investigate a recently proposed problem for open multi-agent systems, referred to as n-agent ad hoc teamwork (NAHT), where only a number of agents are controlled. Existing methods tend to be based on heuristic design and consequently lack theoretical rigor and ambiguous credit assignment among agents. To address these limitations, we model and solve NAHT through the lens of cooperative game theory. More specifically, we first model an open multi-agent system, characterized by its value, as an instance situated in a space of cooperative games, generated by a set of basis games. We then extend this space, along with the state space, to accommodate dynamic scenarios, thereby characterizing NAHT. Exploiting the justifiable assumption that basis game values correspond to a sequence of n-step returns with different horizons, we represent the state values for NAHT in a form similar to $lambda$-returns. Furthermore, we derive Shapley values to allocate state values to the controlled agents, as credits for their contributions to the ad hoc team. Different from the conventional approach to shaping Shapley values in an explicit form, we shape Shapley values by fulfilling the three axioms uniquely describing them, well defined on the extended game space describing NAHT. To estimate Shapley values in dynamic scenarios, we propose a TD($lambda$)-like algorithm. The resulting reinforcement learning (RL) algorithm is referred to as Shapley Machine. To our best knowledge, this is the first time that the concepts from cooperative game theory are directly related to RL concepts. In experiments, we demonstrate the effectiveness of Shapley Machine and verify reasonableness of our theory.