🤖 AI Summary
This work addresses the computational intractability often encountered in jointly learning communication and control policies in partially observable multi-agent systems, which stems from non-classical information structures. By formalizing the problem as a decentralized partially observable Markov decision process (Dec-POMDP) grounded in common information, the study focuses on the tractable subclass characterized by quasi-classical information structures. It establishes, for the first time, a theoretical link between information structure theory in decentralized stochastic control and learnable communication protocols. The authors introduce verifiable conditions for quasi-classicality and overcome the conventional limitation of relying on policy-independent common beliefs. Leveraging a decomposition of common information and policy-independent belief analysis, they develop planning and reinforcement learning algorithms with quasi-polynomial time and sample complexity, thereby significantly expanding the class of Dec-POMDPs amenable to efficient solution.
📝 Abstract
Learning-to-communicate (LTC) in partially observable environments has received increasing attention in deep multi-agent reinforcement learning, where the control and communication strategies are jointly learned. Meanwhile, the impact of communication on decision-making has been extensively studied in control theory. In this paper, we seek to formalize and better understand LTC by bridging these two lines of work, through the lens of information structures (ISs). To this end, we formalize LTC in decentralized partially observable Markov decision processes (Dec-POMDPs) under the common-information-based framework from decentralized stochastic control, and classify LTC problems based on the ISs before (additional) information sharing. We first show that non-classical LTCs are computationally intractable in general, and thus focus on quasi-classical (QC) LTCs. We then propose a series of conditions for QC LTCs, under which LTCs preserve the QC IS after information sharing, whereas violating which can cause computational hardness in general. Further, we develop provable planning and learning algorithms for QC LTCs, and establish quasi-polynomial time and sample complexities for several QC LTC examples that satisfy the above conditions. Along the way, we also establish results on the relationship between (strictly) QC IS and the condition of having strategy-independent common-information-based beliefs (SI-CIBs), as well as on solving Dec-POMDPs without computationally intractable oracles but beyond those with SI-CIBs, which may be of independent interest.