A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence

📅 2024-08-01
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the convergence of policy gradient methods in two-player zero-sum extensive-form games with imperfect information (EFGs). To overcome limitations of conventional approaches—namely, their reliance on counterfactual value estimation and importance sampling—we propose the first entropy-regularized policy gradient framework specifically designed for EFGs, requiring neither explicit value estimation nor importance-weight correction. Theoretically, we establish the first convergence guarantee for such a method under self-play, proving it attains the regularized Nash equilibrium at the optimal iteration rate. Empirically, we validate its stable convergence and high sample efficiency on standard EFG benchmarks. Our core contribution bridges a long-standing theoretical gap in policy gradient analysis for imperfect-information games, delivering the first provably convergent, model-free, end-to-end learning paradigm for EFGs.

Technology Category

Application Category

📝 Abstract
Policy gradient methods have become a staple of any single-agent reinforcement learning toolbox, due to their combination of desirable properties: iterate convergence, efficient use of stochastic trajectory feedback, and theoretically-sound avoidance of importance sampling corrections. In multi-agent imperfect-information settings (extensive-form games), however, it is still unknown whether the same desiderata can be guaranteed while retaining theoretical guarantees. Instead, sound methods for extensive-form games rely on approximating counterfactual values (as opposed to Q values), which are incompatible with policy gradient methodologies. In this paper, we investigate whether policy gradient can be safely used in two-player zero-sum imperfect-information extensive-form games (EFGs). We establish positive results, showing for the first time that a policy gradient method leads to provable best-iterate convergence to a regularized Nash equilibrium in self-play.
Problem

Research questions and friction points this paper is trying to address.

Policy gradient methods in imperfect-information games
Convergence to Nash equilibrium in EFGs
Avoiding counterfactual value approximations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy gradient for imperfect-information games
Best-iterate convergence guarantee
Regularized Nash equilibrium achievement
🔎 Similar Papers
No similar papers found.
M
Mingyang Liu
LIDS, EECS, Massachusetts Institute of Technology
Gabriele Farina
Gabriele Farina
Assistant Professor of Computer Science, MIT
Computational Game TheoryOptimizationEconomics and Computation
A
A. Ozdaglar
LIDS, EECS, Massachusetts Institute of Technology