🤖 AI Summary
In Pursuit-Evasion Games (PEGs), dynamic graph structures hinder policy generalization and necessitate frequent fine-tuning. Method: This paper proposes the first reinforcement learning framework enabling cross-graph zero-shot transfer. It introduces Nash equilibrium policies as supervisory signals for training, integrates sequence modeling for joint policy decomposition, and designs distance-based graph-invariant features alongside an equilibrium-inspired heuristic to enhance multi-agent scalability. Contributions/Results: (1) Achieves zero-shot generalization to unseen graph topologies and exit configurations; (2) On real-world graph datasets, the pursuit policy attains zero-shot performance comparable to state-of-the-art (SOTA) methods *after* fine-tuning—eliminating the need for task-specific adaptation and significantly reducing deployment overhead.
📝 Abstract
Equilibrium learning in adversarial games is an important topic widely examined in the fields of game theory and reinforcement learning (RL). Pursuit-evasion game (PEG), as an important class of real-world games from the fields of robotics and security, requires exponential time to be accurately solved. When the underlying graph structure varies, even the state-of-the-art RL methods require recomputation or at least fine-tuning, which can be time-consuming and impair real-time applicability. This paper proposes an Equilibrium Policy Generalization (EPG) framework to effectively learn a generalized policy with robust cross-graph zero-shot performance. In the context of PEGs, our framework is generally applicable to both pursuer and evader sides in both no-exit and multi-exit scenarios. These two generalizability properties, to our knowledge, are the first to appear in this domain. The core idea of the EPG framework is to train an RL policy across different graph structures against the equilibrium policy for each single graph. To construct an equilibrium oracle for single-graph policies, we present a dynamic programming (DP) algorithm that provably generates pure-strategy Nash equilibrium with near-optimal time complexity. To guarantee scalability with respect to pursuer number, we further extend DP and RL by designing a grouping mechanism and a sequence model for joint policy decomposition, respectively. Experimental results show that, using equilibrium guidance and a distance feature proposed for cross-graph PEG training, the EPG framework guarantees desirable zero-shot performance in various unseen real-world graphs. Besides, when trained under an equilibrium heuristic proposed for the graphs with exits, our generalized pursuer policy can even match the performance of the fine-tuned policies from the state-of-the-art PEG methods.