🤖 AI Summary
Value decomposition methods in multi-agent reinforcement learning suffer from representational limitations and optimization difficulties—particularly when joint value decomposition is insufficient or when local agent observations mismatch global state features, leading to misleading policy updates.
Method: We propose the Heterogeneous Strategy Fusion (HSF) framework, which avoids designing novel decomposition architectures and instead horizontally integrates mainstream value decomposition methods (e.g., VDN, QMIX, QPLEX). HSF employs a policy-ensemble-driven dynamic selection mechanism for adaptive scheduling, enforces consistency constraints across heterogeneous policies to mitigate exploration bias, and combines policy distillation with collaborative updates to ensure high-quality joint policies.
Results: Evaluated on standard StarCraft II cooperative tasks, HSF significantly outperforms baseline methods—achieving faster convergence, enhanced collaboration robustness, and plug-and-play compatibility without requiring modifications to underlying algorithms.
📝 Abstract
Value decomposition (VD) has become one of the most prominent solutions in cooperative multi-agent reinforcement learning. Most existing methods generally explore how to factorize the joint value and minimize the discrepancies between agent observations and characteristics of environmental states. However, direct decomposition may result in limited representation or difficulty in optimization. Orthogonal to designing a new factorization scheme, in this paper, we propose Heterogeneous Policy Fusion (HPF) to integrate the strengths of various VD methods. We construct a composite policy set to select policies for interaction adaptively. Specifically, this adaptive mechanism allows agents' trajectories to benefit from diverse policy transitions while incorporating the advantages of each factorization method. Additionally, HPF introduces a constraint between these heterogeneous policies to rectify the misleading update caused by the unexpected exploratory or suboptimal non-cooperation. Experimental results on cooperative tasks show HPF's superior performance over multiple baselines, proving its effectiveness and ease of implementation.