🤖 AI Summary
This paper studies learning pure-strategy Nash equilibria in two-player zero-sum games under bandit feedback—where the row player observes only the payoff of the chosen action pair—and with an unknown payoff matrix and adversarial column player. We propose ETC-TPZSG, a novel algorithm based on the explore-then-commit (ETC) framework, and its adaptive elimination variant ETC-TPZSG-AE, which incorporates an action-pair adaptive elimination (AE) mechanism leveraging ε-Nash equilibrium properties to accelerate convergence. We establish, for the first time, instance-dependent expected regret upper bounds for ETC-type algorithms in zero-sum games: O(Δ + √T) for ETC-TPZSG and O(log(TΔ²)/Δ) for ETC-TPZSG-AE—thereby filling a key theoretical gap. The AE mechanism significantly improves convergence efficiency, achieving performance comparable to state-of-the-art methods while providing finer-grained, instance-specific characterization.
📝 Abstract
We study a two-player zero-sum game (TPZSG) in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose and analyze two algorithms: ETC-TPZSG, which directly applies ETC to the TPZSG setting and ETC-TPZSG-AE, which improves upon it by incorporating an action pair elimination (AE) strategy that leverages the $varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair. Our objective is to demonstrate the applicability of ETC in a TPZSG setting by focusing on learning pure strategy Nash Equilibrium. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret for both algorithms, has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(Delta + sqrt{T})$ for ETC-TPZSG and $O(frac{log (T Delta^2)}{Delta})$ for ETC-TPZSG-AE, where $Delta$ denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in adversarial game settings, achieving regret bounds comparable to existing methods while providing insights through instance-dependent analysis.