Logarithmic Regret for Matrix Games against an Adversary with Noisy Bandit Feedback

📅 2023-06-22

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

🤖 AI Summary

This paper studies online learning for the row player in zero-sum matrix games under noisy feedback, aiming to minimize Nash regret. Existing algorithms—such as EXP3 and UCB—achieve only $O(1/sqrt{T})$ accuracy in estimating the Nash value, and their Nash regret is fundamentally lower-bounded by $Omega(1/sqrt{T})$. We propose a novel algorithm that, for the first time, attains $O(mathrm{polylog}(T)/T)$ Nash regret in $2 imes2$ games. Our method integrates game-structure-aware policy design, an enhanced upper-confidence-bound (UCB) mechanism, and a synergistic estimation–control framework. We provide a rigorous theoretical analysis proving logarithmic-rate convergence of the Nash regret. Empirical results demonstrate substantial improvements over EXP3 and UCB baselines. Furthermore, we formally prove that classical bandit-based algorithms cannot surpass the $Omega(1/sqrt{T})$ Nash regret lower bound, highlighting the necessity of our structural approach.

📝 Abstract

This paper considers a variant of zero-sum matrix games where at each timestep the row player chooses row $i$, the column player chooses column $j$, and the row player receives a noisy reward with mean $A_{i,j}$. The objective of the row player is to accumulate as much reward as possible, even against an adversarial column player. If the row player uses the EXP3 strategy, an algorithm known for obtaining $sqrt{T}$ regret against an arbitrary sequence of rewards, it is immediate that the row player also achieves $sqrt{T}$ regret relative to the Nash equilibrium in this game setting. However, partly motivated by the fact that the EXP3 strategy is myopic to the structure of the game, O'Donoghue et al. (2021) proposed a UCB-style algorithm that leverages the game structure and demonstrated that this algorithm greatly outperforms EXP3 empirically. While they showed that this UCB-style algorithm achieved $sqrt{T}$ regret, in this paper we ask if there exists an algorithm that provably achieves $ ext{polylog}(T)$ regret against any adversary, analogous to results from stochastic bandits. We propose a novel algorithm that answers this question in the affirmative for the simple $2 imes 2$ setting, providing the first instance-dependent guarantees for games in the regret setting. Our algorithm overcomes two major hurdles: 1) obtaining logarithmic regret even though the Nash equilibrium is estimable only at a $1/sqrt{T}$ rate, and 2) designing row-player strategies that guarantee that either the adversary provides information about the Nash equilibrium, or the row player incurs negative regret. Moreover, in the full information case we address the general $n imes m$ case where the first hurdle is still relevant. Finally, we show that EXP3 and the UCB-based algorithm necessarily cannot perform better than $sqrt{T}$.

Problem

Research questions and friction points this paper is trying to address.

Study Nash regret minimization in noisy zero-sum matrix games

Analyze limitations of existing algorithms for Nash regret

Propose new algorithms achieving polylog(T) Nash regret

Innovation

Methods, ideas, or system contributions that make the work stand out.

Studies Nash regret in noisy zero-sum games

Proves Ω(√T) regret for existing algorithms

Achieves polylog(T) regret with new algorithm

🔎 Similar Papers

No similar papers found.

Authors to Follow