🤖 AI Summary
This work addresses the challenge of achieving optimal last-iterate convergence in zero-sum matrix games with bandit feedback, where existing uncoupled algorithms fall short. The authors propose an online mirror descent method based on log-barrier regularization, analyzed through a dual-space framework, which for the first time attains an exploitability gap convergence rate of Õ(t⁻¹/⁴) with high probability in the uncoupled setting—matching the known theoretical lower bound Ω(t⁻¹/⁴). The approach is further extended to extensive-form games while preserving this optimal convergence rate, thereby significantly advancing the theoretical foundations of uncoupled multi-agent learning.
📝 Abstract
We study the problem of learning minimax policies in zero-sum matrix games. Fiegel et al. (2025) recently showed that achieving last-iterate convergence in this setting is harder when the players are uncoupled, by proving a lower bound on the exploitability gap of Omega(t^{-1/4}). Some online mirror descent algorithms were proposed in the literature for this problem, but none have truly attained this rate yet. We show that the use of a log-barrier regularization, along with a dual-focused analysis, allows this O-tilde(t^{-1/4}) convergence with high-probability. We additionally extend our idea to the setting of extensive-form games, proving a bound with the same rate.