🤖 AI Summary
Multi-agent self-play in continuous decision spaces often suffers from poor convergence to Nash equilibria and limited policy generalization. To address this, we propose DiffFP—a novel framework that introduces diffusion models into fictitious play (FP), enabling generative modeling of multimodal best-response policies. DiffFP explicitly captures the uncertainty in strategy distributions via a diffusion process and supports end-to-end training. In zero-sum settings—including racing and multi-particle games—it achieves rapid convergence to ε-Nash equilibria. Empirical results demonstrate that DiffFP accelerates convergence by up to 3× and improves task success rates by an average of 30× over state-of-the-art reinforcement learning baselines. Moreover, it significantly enhances policy robustness, diversity, and adaptability to unseen opponents—thereby overcoming key limitations of conventional FP and RL-based approaches in continuous multi-agent learning.
📝 Abstract
Self-play reinforcement learning has demonstrated significant success in learning complex strategic and interactive behaviors in competitive multi-agent games. However, achieving such behaviors in continuous decision spaces remains challenging. Ensuring adaptability and generalization in self-play settings is critical for achieving competitive performance in dynamic multi-agent environments. These challenges often cause methods to converge slowly or fail to converge at all to a Nash equilibrium, making agents vulnerable to strategic exploitation by unseen opponents. To address these challenges, we propose DiffFP, a fictitious play (FP) framework that estimates the best response to unseen opponents while learning a robust and multimodal behavioral policy. Specifically, we approximate the best response using a diffusion policy that leverages generative modeling to learn adaptive and diverse strategies. Through empirical evaluation, we demonstrate that the proposed FP framework converges towards $ε$-Nash equilibria in continuous- space zero-sum games. We validate our method on complex multi-agent environments, including racing and multi-particle zero-sum games. Simulation results show that the learned policies are robust against diverse opponents and outperform baseline reinforcement learning policies. Our approach achieves up to 3$ imes$ faster convergence and 30$ imes$ higher success rates on average against RL-based baselines, demonstrating its robustness to opponent strategies and stability across training iterations