🤖 AI Summary
This work identifies a fundamental flaw in the Filatov–Tolwinski (FT) algorithm: it may fail to converge to a saddle-point policy—or even enter infinite cycles—in zero-sum Markov games and robust MDPs, due to an implicit, unjustified assumption in its original convergence proof. To address this, we propose Residual-Conditioned Policy Iteration (RCPI), a novel framework unifying policy iteration with saddle-point optimization. RCPI introduces a residual-driven update rule that ensures global convergence to the unique saddle point while substantially improving computational efficiency. We establish that RCPI exhibits linear convergence under standard assumptions. Numerical experiments demonstrate that RCPI achieves speedups of 1–3 orders of magnitude over existing convergent algorithms, with markedly improved stability. This is the first systematic exposition of the FT algorithm’s convergence failure and the first method to provide both rigorous convergence guarantees and high empirical efficiency for saddle-point policy computation in zero-sum Markov games and robust MDPs.
📝 Abstract
Markov games and robust MDPs are closely related models that involve computing a pair of saddle point policies. As part of the long-standing effort to develop efficient algorithms for these models, the Filar-Tolwinski (FT) algorithm has shown considerable promise. As our first contribution, we demonstrate that FT may fail to converge to a saddle point and may loop indefinitely, even in small games. This observation contradicts the proof of FT's convergence to a saddle point in the original paper. As our second contribution, we propose Residual Conditioned Policy Iteration (RCPI). RCPI builds on FT, but is guaranteed to converge to a saddle point. Our numerical results show that RCPI outperforms other convergent algorithms by several orders of magnitude.