Two-Player Zero-Sum Games with Bandit Feedback

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This paper studies learning pure-strategy Nash equilibria in two-player zero-sum games under bandit feedback—where the row player observes only the payoff of the chosen action pair—and with an unknown payoff matrix and adversarial column player. We propose ETC-TPZSG, a novel algorithm based on the explore-then-commit (ETC) framework, and its adaptive elimination variant ETC-TPZSG-AE, which incorporates an action-pair adaptive elimination (AE) mechanism leveraging ε-Nash equilibrium properties to accelerate convergence. We establish, for the first time, instance-dependent expected regret upper bounds for ETC-type algorithms in zero-sum games: O(Δ + √T) for ETC-TPZSG and O(log(TΔ²)/Δ) for ETC-TPZSG-AE—thereby filling a key theoretical gap. The AE mechanism significantly improves convergence efficiency, achieving performance comparable to state-of-the-art methods while providing finer-grained, instance-specific characterization.

Technology Category

Application Category

📝 Abstract

We study a two-player zero-sum game (TPZSG) in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose and analyze two algorithms: ETC-TPZSG, which directly applies ETC to the TPZSG setting and ETC-TPZSG-AE, which improves upon it by incorporating an action pair elimination (AE) strategy that leverages the $varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair. Our objective is to demonstrate the applicability of ETC in a TPZSG setting by focusing on learning pure strategy Nash Equilibrium. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret for both algorithms, has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(Delta + sqrt{T})$ for ETC-TPZSG and $O(frac{log (T Delta^2)}{Delta})$ for ETC-TPZSG-AE, where $Delta$ denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in adversarial game settings, achieving regret bounds comparable to existing methods while providing insights through instance-dependent analysis.

Problem

Research questions and friction points this paper is trying to address.

Maximizing payoff in two-player zero-sum games with unknown matrix

Learning pure strategy Nash Equilibrium via bandit feedback

Deriving instance-dependent regret bounds for ETC-based algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

ETC-TPZSG applies ETC to zero-sum games

ETC-TPZSG-AE uses action elimination strategy

Derives instance-dependent regret upper bounds

🔎 Similar Papers

A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence