🤖 AI Summary
This work addresses the challenge of decentralized collaborative planning for embodied multi-agent systems under partial observability, where agents rely solely on egocentric visual observations. We propose a composable generative world model framework that explicitly decomposes and composes joint multi-agent actions, integrating implicit world-state estimation with vision-language model (VLM)-driven inference of others’ behaviors—enabling online, scalable coordination for arbitrary numbers of agents. Technically, the framework unifies compositional video prediction, partially observable state reconstruction, and tree-search-based planning. Evaluated on three major embodied collaboration benchmarks with 2–4 agents, our method significantly improves cooperative efficiency and generalizes robustly to unseen agent compositions and diverse tasks. These results demonstrate the framework’s strong scalability, adaptability, and practical utility for real-world embodied multi-agent coordination.
📝 Abstract
In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video conditioned on the world state. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. We evaluate our methods on three challenging benchmarks with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed methods. More videos can be found at https://embodied-agi.cs.umass.edu/combo/.