QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-agent reinforcement learning, the max operator commonly induces systematic overestimation of Q-values under the combinatorial explosion of joint action spaces, leading to suboptimal policies and training instability. This work proposes QSIM, a novel framework that, for the first time, incorporates action similarity into the construction of temporal difference (TD) targets. Instead of relying on greedy action selection, QSIM reconstructs the expected Q-value by weighting near-greedy joint actions according to their similarity, thereby smoothing the TD target while preserving behavioral diversity and mitigating overestimation. The QSIM framework seamlessly integrates with mainstream value decomposition methods, consistently achieving significant improvements in both performance and training stability across multiple benchmark tasks, while effectively alleviating Q-value overestimation.

Technology Category

Application Category

📝 Abstract
Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.
Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning
value overestimation
value decomposition
temporal-difference learning
joint action space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Q-value overestimation
action similarity
value decomposition
multi-agent reinforcement learning
temporal-difference target
🔎 Similar Papers
No similar papers found.
Y
Yuanjun Li
Shandong University
Bin Zhang
Bin Zhang
Institute of Automation,Chinese Academy of Sciences
AI AgentMulti-agent SystemReinforcement Learning
H
Hao Chen
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences
Z
Zhouyang Jiang
Shandong University
Dapeng Li
Dapeng Li
Institute of Automation, Chinese Academy of Sciences
MARLLLM
Zhiwei Xu
Zhiwei Xu
Shandong University
Reinforcement LearningMulti-Agent SystemLLM-based Agent