π€ AI Summary
Existing video diffusion models struggle to accurately bind specific actions to their corresponding agents in multi-agent scenarios and lack the capability for synchronized control of multiple characters. This work proposes ActionParty, a generative world model that enables multi-agent action control by introducing state tokens representing each agentβs state and integrating a spatial bias mechanism to jointly model these states with video latent representations. This design decouples global rendering from individual action updates, allowing precise simultaneous control of up to seven agents for the first time. Evaluated on 46 environments from the Melting Pot benchmark, ActionParty significantly improves action-following accuracy and identity consistency, effectively addressing the challenges of multi-agent action binding and autoregressive tracking.
π Abstract
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.