Convergence of Fast Policy Iteration in Markov Games and Robust MDPs

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a fundamental flaw in the Filatov–Tolwinski (FT) algorithm: it may fail to converge to a saddle-point policy—or even enter infinite cycles—in zero-sum Markov games and robust MDPs, due to an implicit, unjustified assumption in its original convergence proof. To address this, we propose Residual-Conditioned Policy Iteration (RCPI), a novel framework unifying policy iteration with saddle-point optimization. RCPI introduces a residual-driven update rule that ensures global convergence to the unique saddle point while substantially improving computational efficiency. We establish that RCPI exhibits linear convergence under standard assumptions. Numerical experiments demonstrate that RCPI achieves speedups of 1–3 orders of magnitude over existing convergent algorithms, with markedly improved stability. This is the first systematic exposition of the FT algorithm’s convergence failure and the first method to provide both rigorous convergence guarantees and high empirical efficiency for saddle-point policy computation in zero-sum Markov games and robust MDPs.

Technology Category

Application Category

📝 Abstract
Markov games and robust MDPs are closely related models that involve computing a pair of saddle point policies. As part of the long-standing effort to develop efficient algorithms for these models, the Filar-Tolwinski (FT) algorithm has shown considerable promise. As our first contribution, we demonstrate that FT may fail to converge to a saddle point and may loop indefinitely, even in small games. This observation contradicts the proof of FT's convergence to a saddle point in the original paper. As our second contribution, we propose Residual Conditioned Policy Iteration (RCPI). RCPI builds on FT, but is guaranteed to converge to a saddle point. Our numerical results show that RCPI outperforms other convergent algorithms by several orders of magnitude.
Problem

Research questions and friction points this paper is trying to address.

FT algorithm may fail to converge in Markov games
Propose RCPI for guaranteed saddle point convergence
RCPI outperforms other algorithms significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Residual Conditioned Policy Iteration (RCPI)
Guarantees convergence to saddle point
Outperforms other algorithms significantly
🔎 Similar Papers
No similar papers found.
K
Keith Badger
University of New Hampshire
J
Jefferson Huang
Naval Postgraduate School
Marek Petrik
Marek Petrik
University of New Hampshire
Machine Learning