🤖 AI Summary
In nontransitive zero-sum games, the Policy Space Response Oracles (PSRO) algorithm suffers from slow Nash equilibrium (NE) convergence and high exploitability due to conventional best-response (BR) initialization—either from scratch or via single-policy inheritance.
Method: This paper proposes a Nash strategy fusion mechanism leveraging cumulative historical strategy enhancement. Instead of standard BR initialization, it implicitly constructs meta-Nash-guided policies via meta-game analysis and dynamically fuses the historical policy population using weighted moving averages. The approach integrates PSRO, meta-game analysis, and RL-inspired strategy initialization principles in a unified framework.
Results: Evaluated on canonical benchmarks, the method significantly reduces exploitability and achieves faster convergence and higher NE approximation accuracy than existing BR initialization strategies. It enhances both the efficiency of policy-space search and the stability of computed equilibria.
📝 Abstract
For solving zero-sum games involving non-transitivity, a useful approach is to maintain a policy population to approximate the Nash Equilibrium (NE). Previous studies have shown that the Policy Space Response Oracles (PSRO) algorithm is an effective framework for solving such games. However, current methods initialize a new policy from scratch or inherit a single historical policy in Best Response (BR), missing the opportunity to leverage past policies to generate a better BR. In this paper, we propose Fusion-PSRO, which employs Nash Policy Fusion to initialize a new policy for BR training. Nash Policy Fusion serves as an implicit guiding policy that starts exploration on the current Meta-NE, thus providing a closer approximation to BR. Moreover, it insightfully captures a weighted moving average of past policies, dynamically adjusting these weights based on the Meta-NE in each iteration. This cumulative process further enhances the policy population. Empirical results on classic benchmarks show that Fusion-PSRO achieves lower exploitability, thereby mitigating the shortcomings of previous research on policy initialization in BR.