🤖 AI Summary
This work addresses the convergence of decentralized multi-agent actor-critic algorithms under partial observability—specifically, without global coordination, knowledge of others’ policies, or reward structures. For general and Markov games, we introduce the Markov Near-Potential Function (MNPF) as a novel approximate Lyapunov function, enabling the first characterization of convergence sets for decentralized policy updates in non-potential games. Leveraging asynchronous stochastic approximation theory and Lyapunov stability analysis, we rigorously establish that the proposed decentralized actor-critic algorithm converges almost surely to the MNPF-defined stable policy set. Under standard regularity conditions—including smoothness, bounded gradients, and sufficient exploration—we further guarantee convergence to a neighborhood of Nash equilibria. This constitutes the first theoretically grounded convergence framework for distributed cooperative learning in unknown environments, providing formal guarantees for decentralized MARL in non-cooperative, partially observable settings.
📝 Abstract
Markov games provide a powerful framework for modeling strategic multi-agent interactions in dynamic environments. Traditionally, convergence properties of decentralized learning algorithms in these settings have been established only for special cases, such as Markov zero-sum and potential games, which do not fully capture real-world interactions. In this letter, we address this gap by studying the asymptotic properties of learning algorithms in general-sum Markov games. In particular, we focus on a decentralized algorithm where each agent adopts an actor-critic learning dynamic with asynchronous step sizes. This decentralized approach enables agents to operate independently, without requiring knowledge of others’ strategies or payoffs. We introduce the concept of a Markov Near-Potential Function (MNPF) and demonstrate that it serves as an approximate Lyapunov function for the policy updates in the decentralized learning dynamics, which allows us to characterize the convergent set of strategies. We further strengthen our result under specific regularity conditions and with finite Nash equilibria.