Offline Two-Player Zero-Sum Markov Games with KL Regularization

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work addresses the problem of learning Nash equilibria in offline two-player zero-sum Markov games, with a focus on overcoming distributional shift challenges. Under a unilateral concentrability assumption, the authors propose ROSE, a theoretical framework and a model-free algorithm named SOS-MD that achieves stable convergence without requiring explicit pessimism mechanisms, relying solely on KL regularization. SOS-MD integrates least-squares value estimation, mirror descent, and iterative self-play. The final iterate of the algorithm converges to the Nash equilibrium at a statistical rate of Õ(1/n), while the optimization error diminishes at a rate of Õ(1/√T) with respect to the number of self-play iterations T, significantly outperforming conventional approaches.
📝 Abstract
We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast $\widetilde{\mathcal{O}}(1/n)$ convergence rate under \textit{unilateral concentrability}, improving over the standard $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same $\widetilde{\mathcal{O}}(1/n)$ statistical rate up to a vanishing optimization error of order $\widetilde{\mathcal{O}}(1/\sqrt{T})$ in the number of self-play iterations $T$.
Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning
two-player zero-sum Markov games
Nash equilibria
distribution shift
KL regularization
Innovation

Methods, ideas, or system contributions that make the work stand out.

KL regularization
offline Markov games
Nash equilibrium
fast convergence rate
self-play mirror descent