🤖 AI Summary
In multi-agent reinforcement learning, the max operator commonly induces systematic overestimation of Q-values under the combinatorial explosion of joint action spaces, leading to suboptimal policies and training instability. This work proposes QSIM, a novel framework that, for the first time, incorporates action similarity into the construction of temporal difference (TD) targets. Instead of relying on greedy action selection, QSIM reconstructs the expected Q-value by weighting near-greedy joint actions according to their similarity, thereby smoothing the TD target while preserving behavioral diversity and mitigating overestimation. The QSIM framework seamlessly integrates with mainstream value decomposition methods, consistently achieving significant improvements in both performance and training stability across multiple benchmark tasks, while effectively alleviating Q-value overestimation.
📝 Abstract
Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.