Offline Two-Player Zero-Sum Markov Games with KL Regularization

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the problem of learning Nash equilibria in offline two-player zero-sum Markov games, with a focus on overcoming distributional shift challenges. Under a unilateral concentrability assumption, the authors propose ROSE, a theoretical framework and a model-free algorithm named SOS-MD that achieves stable convergence without requiring explicit pessimism mechanisms, relying solely on KL regularization. SOS-MD integrates least-squares value estimation, mirror descent, and iterative self-play. The final iterate of the algorithm converges to the Nash equilibrium at a statistical rate of Õ(1/n), while the optimization error diminishes at a rate of Õ(1/√T) with respect to the number of self-play iterations T, significantly outperforming conventional approaches.

📝 Abstract

We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast $\widetilde{\mathcal{O}}(1/n)$ convergence rate under \textit{unilateral concentrability}, improving over the standard $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same $\widetilde{\mathcal{O}}(1/n)$ statistical rate up to a vanishing optimization error of order $\widetilde{\mathcal{O}}(1/\sqrt{T})$ in the number of self-play iterations $T$.

Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning

two-player zero-sum Markov games

Nash equilibria

distribution shift

KL regularization

Innovation

Methods, ideas, or system contributions that make the work stand out.

KL regularization

offline Markov games

Nash equilibrium