🤖 AI Summary
This work addresses the challenge of efficiently constructing a small yet effective strategy set to accurately approximate the Nash equilibrium in two-player zero-sum games under limited computational resources. The authors propose a two-stage explore-and-select framework that, for the first time within the Policy Space Response Oracle (PSRO) paradigm, explicitly optimizes Population Exploitability (PE) as the guiding objective for expanding the strategy set. By leveraging a parameter-sharing conditional neural network, the method efficiently generates candidate strategies and estimates their individual contributions to reducing overall PE. Empirical evaluations across multiple zero-sum games demonstrate that the proposed approach achieves significantly lower exploitability with far fewer iterations compared to existing methods, thereby yielding a more precise approximation of the Nash equilibrium.
📝 Abstract
The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration--selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.