Reevaluating Policy Gradient Methods for Imperfect-Information Games

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The prevailing paradigm asserts that solving imperfect-information games (IIGs) necessitates game-theoretic frameworks—e.g., fictitious play (FP), double oracle (DO), or counterfactual regret minimization (CFR)—while policy gradient methods like PPO are widely presumed ineffective. Method: We systematically reevaluate policy gradient algorithms in IIGs, introducing the first standardized evaluation toolkit enabling exact exploitability computation. We conduct the largest-scale deep reinforcement learning (DRL) benchmark to date across four large IIGs—Leduc Hold’em, Oshi-Zumo, and two others—comprising over 5,600 training runs. Contribution/Results: Contrary to established assumptions, generic policy gradient methods consistently outperform FP, DO, and CFR across all benchmarks. We open-source two fully reproducible benchmark suites, establishing the first fair, verifiable, and large-scale DRL evaluation standard for IIGs—enabling rigorous, exploitation-aware assessment of learning-based solvers.

Technology Category

Application Category

📝 Abstract
In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP, DO, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for four large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 5600 training runs, FP, DO, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .
Problem

Research questions and friction points this paper is trying to address.

Evaluate policy gradient methods in imperfect-information games.
Compare DRL algorithms using exact exploitability computations.
Assess performance of generic methods versus FP, DO, CFR-based approaches.
Innovation

Methods, ideas, or system contributions that make the work stand out.

PPO outperforms FP, DO, CFR
Exact exploitability computations released
Largest DRL exploitability comparison conducted
🔎 Similar Papers
No similar papers found.
M
Max Rudolph
University of Texas at Austin
Nathan Lichtlé
Nathan Lichtlé
PhD Student, UC Berkeley
Reinforcement LearningDeep LearningMulti-Agent SystemsControlTraffic Optimization
S
Sobhan Mohammadpour
Massachusetts Institute of Technology
A
Alexandre M. Bayen
University of California, Berkeley
J
J. Z. Kolter
Carnegie Mellon University
A
Amy Zhang
University of Texas at Austin
Gabriele Farina
Gabriele Farina
Assistant Professor of Computer Science, MIT
Computational Game TheoryOptimizationEconomics and Computation
Eugene Vinitsky
Eugene Vinitsky
Assistant Professor, NYU
Reinforcement LearningAutonomous VehiclesMulti-agent SystemsControl
S
Samuel Sokota
Carnegie Mellon University