π€ AI Summary
In zero-shot coordination (ZSC), two core bottlenecks impede progress: insufficient partner agent diversity and inefficient cross-game policy minimization (XPM) trainingβXPM relies on costly multi-trajectory environment sampling and requires independent, from-scratch training for each partner. To address these, we propose XPM-WM, the first framework to integrate a world model (comprising a VAE and RSSM) into XPM. It replaces expensive real-environment trajectory sampling with model-generated synthetic trajectories, eliminating the need for multi-trajectory collection. Moreover, a shared, reusable dynamics model drives the evolution of diverse partner policies, obviating redundant per-partner training. Evaluated on the SP benchmark, XPM-WM matches state-of-the-art performance in ZSC success rate and population training reward while improving sample efficiency by 3.2Γ and enabling efficient generation of partner agents at scale (up to hundreds).
π Abstract
A major bottleneck in the training process for Zero-Shot Coordination (ZSC) agents is the generation of partner agents that are diverse in collaborative conventions. Current Cross-play Minimization (XPM) methods for population generation can be very computationally expensive and sample inefficient as the training objective requires sampling multiple types of trajectories. Each partner agent in the population is also trained from scratch, despite all of the partners in the population learning policies of the same coordination task. In this work, we propose that simulated trajectories from the dynamics model of an environment can drastically speed up the training process for XPM methods. We introduce XPM-WM, a framework for generating simulated trajectories for XPM via a learned World Model (WM). We show XPM with simulated trajectories removes the need to sample multiple trajectories. In addition, we show our proposed method can effectively generate partners with diverse conventions that match the performance of previous methods in terms of SP population training reward as well as training partners for ZSC agents. Our method is thus, significantly more sample efficient and scalable to a larger number of partners.