QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
In multi-agent reinforcement learning, the max operator commonly induces systematic overestimation of Q-values under the combinatorial explosion of joint action spaces, leading to suboptimal policies and training instability. This work proposes QSIM, a novel framework that, for the first time, incorporates action similarity into the construction of temporal difference (TD) targets. Instead of relying on greedy action selection, QSIM reconstructs the expected Q-value by weighting near-greedy joint actions according to their similarity, thereby smoothing the TD target while preserving behavioral diversity and mitigating overestimation. The QSIM framework seamlessly integrates with mainstream value decomposition methods, consistently achieving significant improvements in both performance and training stability across multiple benchmark tasks, while effectively alleviating Q-value overestimation.

Technology Category

Application Category

📝 Abstract
Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.
Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning
value overestimation
value decomposition
temporal-difference learning
joint action space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Q-value overestimation
action similarity
value decomposition
multi-agent reinforcement learning
temporal-difference target