Efficient Generation of Diverse Cooperative Agents with World Models

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In zero-shot coordination (ZSC), two core bottlenecks impede progress: insufficient partner agent diversity and inefficient cross-game policy minimization (XPM) training—XPM relies on costly multi-trajectory environment sampling and requires independent, from-scratch training for each partner. To address these, we propose XPM-WM, the first framework to integrate a world model (comprising a VAE and RSSM) into XPM. It replaces expensive real-environment trajectory sampling with model-generated synthetic trajectories, eliminating the need for multi-trajectory collection. Moreover, a shared, reusable dynamics model drives the evolution of diverse partner policies, obviating redundant per-partner training. Evaluated on the SP benchmark, XPM-WM matches state-of-the-art performance in ZSC success rate and population training reward while improving sample efficiency by 3.2× and enabling efficient generation of partner agents at scale (up to hundreds).

Technology Category

Application Category

📝 Abstract

A major bottleneck in the training process for Zero-Shot Coordination (ZSC) agents is the generation of partner agents that are diverse in collaborative conventions. Current Cross-play Minimization (XPM) methods for population generation can be very computationally expensive and sample inefficient as the training objective requires sampling multiple types of trajectories. Each partner agent in the population is also trained from scratch, despite all of the partners in the population learning policies of the same coordination task. In this work, we propose that simulated trajectories from the dynamics model of an environment can drastically speed up the training process for XPM methods. We introduce XPM-WM, a framework for generating simulated trajectories for XPM via a learned World Model (WM). We show XPM with simulated trajectories removes the need to sample multiple trajectories. In addition, we show our proposed method can effectively generate partners with diverse conventions that match the performance of previous methods in terms of SP population training reward as well as training partners for ZSC agents. Our method is thus, significantly more sample efficient and scalable to a larger number of partners.

Problem

Research questions and friction points this paper is trying to address.

Generate diverse partner agents for Zero-Shot Coordination efficiently

Reduce computational cost of Cross-play Minimization methods

Improve sample efficiency in training cooperative agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses simulated trajectories from world models

Eliminates need for multiple trajectory sampling

Enhances diversity and efficiency in training

🔎 Similar Papers

No similar papers found.