🤖 AI Summary
This work addresses the global convergence of independent policy gradient methods in two-player zero-sum convex Markov games (cMGs). Fundamental challenges—including infinite horizons, nonconvex policy parameterizations, and absence of Bellman consistency—hinder existing analyses. To overcome these, we introduce a novel *implicit convex–implicit concave* regularization framework that reformulates the original min-max problem into an objective satisfying the nonconvex proximal Polyak–Łojasiewicz (PL) condition. Building upon this, we propose a stochastic nested alternating gradient ascent–descent algorithm. We establish the first global convergence guarantee to Nash equilibria in cMGs, with an explicit sublinear convergence rate. Our analysis is theoretically robust and generalizes beyond cMGs: the methodology directly extends to generic constrained min-max optimization problems, offering broad applicability in nonconvex game-theoretic and adversarial learning settings.
📝 Abstract
We contribute the first provable guarantees of global convergence to Nash equilibria (NE) in two-player zero-sum convex Markov games (cMGs) by using independent policy gradient methods. Convex Markov games, recently defined by Gemp et al. (2024), extend Markov decision processes to multi-agent settings with preferences that are convex over occupancy measures, offering a broad framework for modeling generic strategic interactions. However, even the fundamental min-max case of cMGs presents significant challenges, including inherent nonconvexity, the absence of Bellman consistency, and the complexity of the infinite horizon. We follow a two-step approach. First, leveraging properties of hidden-convex--hidden-concave functions, we show that a simple nonconvex regularization transforms the min-max optimization problem into a nonconvex-proximal Polyak-Lojasiewicz (NC-pPL) objective. Crucially, this regularization can stabilize the iterates of independent policy gradient methods and ultimately lead them to converge to equilibria. Second, building on this reduction, we address the general constrained min-max problems under NC-pPL and two-sided pPL conditions, providing the first global convergence guarantees for stochastic nested and alternating gradient descent-ascent methods, which we believe may be of independent interest.