Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles

📅 2024-05-31

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

In nontransitive zero-sum games, the Policy Space Response Oracles (PSRO) algorithm suffers from slow Nash equilibrium (NE) convergence and high exploitability due to conventional best-response (BR) initialization—either from scratch or via single-policy inheritance. Method: This paper proposes a Nash strategy fusion mechanism leveraging cumulative historical strategy enhancement. Instead of standard BR initialization, it implicitly constructs meta-Nash-guided policies via meta-game analysis and dynamically fuses the historical policy population using weighted moving averages. The approach integrates PSRO, meta-game analysis, and RL-inspired strategy initialization principles in a unified framework. Results: Evaluated on canonical benchmarks, the method significantly reduces exploitability and achieves faster convergence and higher NE approximation accuracy than existing BR initialization strategies. It enhances both the efficiency of policy-space search and the stability of computed equilibria.

Technology Category

Application Category

📝 Abstract

For solving zero-sum games involving non-transitivity, a useful approach is to maintain a policy population to approximate the Nash Equilibrium (NE). Previous studies have shown that the Policy Space Response Oracles (PSRO) algorithm is an effective framework for solving such games. However, current methods initialize a new policy from scratch or inherit a single historical policy in Best Response (BR), missing the opportunity to leverage past policies to generate a better BR. In this paper, we propose Fusion-PSRO, which employs Nash Policy Fusion to initialize a new policy for BR training. Nash Policy Fusion serves as an implicit guiding policy that starts exploration on the current Meta-NE, thus providing a closer approximation to BR. Moreover, it insightfully captures a weighted moving average of past policies, dynamically adjusting these weights based on the Meta-NE in each iteration. This cumulative process further enhances the policy population. Empirical results on classic benchmarks show that Fusion-PSRO achieves lower exploitability, thereby mitigating the shortcomings of previous research on policy initialization in BR.

Problem

Research questions and friction points this paper is trying to address.

Improving policy initialization in Best Response training

Leveraging past policies for better Nash Equilibrium approximation

Reducing exploitability in zero-sum non-transitive games

Innovation

Methods, ideas, or system contributions that make the work stand out.

Nash Policy Fusion initializes new BR policies

Dynamic weighted average of past policies

Enhances policy population via Meta-NE guidance

🔎 Similar Papers

Personalisation via Dynamic Policy Fusion