The Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses whether decoupled learning algorithms—requiring no communication between players—can guarantee convergence of the last-iterate strategies to a Nash equilibrium in repeated two-player zero-sum matrix games with bandit feedback. To tackle this question, the authors propose two regularized algorithms based on two-step mirror descent and an exploration–exploitation trade-off mechanism. Under the fully decoupled setting with only bandit feedback, they establish, for the first time, a theoretical lower bound of Ω(T⁻¹/⁴) on the convergence rate of last-iterate exploitability. Furthermore, they design an algorithm that matches this lower bound, achieving an optimal convergence rate of O(T⁻¹/⁴), up to constant and logarithmic factors.

Technology Category

Application Category

📝 Abstract

We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, the convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of $\mathcal{O}(T^{-1/8})$ on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performance, with the best attainable rate being $Ω(T^{-1/4})$ in contrast to the usual $Ω(T^{-1/2})$ rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate up to constant and logarithmic factors. The first algorithm leverages a straightforward trade-off between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.

Problem

Research questions and friction points this paper is trying to address.

zero-sum games

bandit feedback

uncoupled learning

last-iterate convergence

Nash equilibrium

Innovation

Methods, ideas, or system contributions that make the work stand out.

last-iterate convergence

bandit feedback

uncoupled learning