TVDO: Tchebycheff Value-Decomposition Optimization for Multi-Agent Reinforcement Learning

📅 2023-06-24
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multi-agent reinforcement learning (MARL) under the centralized training with decentralized execution (CTDE) paradigm, a fundamental inconsistency arises between jointly trained policies and independently executed actions. Method: This paper proposes a Chebyshev-scalarized multi-objective value decomposition approach, formulating Q-function learning as a constrained multi-objective optimization problem. Contribution/Results: It provides the first theoretical proof—without additional assumptions—that the method satisfies the individual-global max (IGM) necessary and sufficient condition. Crucially, it rigorously upper-bounds the deviation between individual action-value estimates and their true values, thereby guaranteeing consistency between globally optimal and individually optimal action selections. Experiments demonstrate exact value factorization in ramp-up and penalty games; in StarCraft II micromanagement tasks, it significantly outperforms mainstream methods including QMIX and VDN, empirically validating both theoretical consistency and practical efficacy.
📝 Abstract
In cooperative multi-agent reinforcement learning (MARL) settings, the centralized training with decentralized execution (CTDE) becomes customary recently due to the physical demand. However, the most dilemma is the inconsistency of jointly-trained policies and individually-optimized actions. In this work, we propose a novel value-based multi-objective learning approach, named Tchebycheff value decomposition optimization (TVDO), to overcome the above dilemma. In particular, a nonlinear Tchebycheff aggregation method is designed to transform the MARL task into multi-objective optimal counterpart by tightly constraining the upper bound of individual action-value bias. We theoretically prove that TVDO well satisfies the necessary and sufficient condition of individual global max (IGM) with no extra limitations, which exactly guarantees the consistency between the global and individual optimal action-value function. Empirically, in the climb and penalty game, we verify that TVDO represents precisely from global to individual value factorization with a guarantee of the policy consistency. Furthermore, we also evaluate TVDO in the challenging scenarios of StarCraft II micromanagement tasks, and extensive experiments demonstrate that TVDO achieves more competitive performances than several state-of-the-art MARL methods.
Problem

Research questions and friction points this paper is trying to address.

Addresses inconsistency in multi-agent reinforcement learning policies
Proposes Tchebycheff value-decomposition for global optimum
Ensures consistency between global and individual optimal actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tchebycheff aggregation for value decomposition
Guarantees Individual-Global-Max consistency
Nonlinear upper bound constraint optimization
🔎 Similar Papers
No similar papers found.
X
Xiaoliang Hu
The Nanjing University of Science and Technology, Nanjing, China
Pengcheng Guo
Pengcheng Guo
Northwestern Polytechnical University
Speech RecognitionMachine LearningDeep Learnining
C
Chuanwei Zhou
The Nanjing University of Science and Technology, Nanjing, China
T
Tong Zhang
The Nanjing University of Science and Technology, Nanjing, China
Zhen Cui
Zhen Cui
Beijing Normal University
Pattern Recognition and Computer Vision