ActionParty: Multi-Subject Action Binding in Generative Video Games

πŸ“… 2026-04-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video diffusion models struggle to accurately bind specific actions to their corresponding agents in multi-agent scenarios and lack the capability for synchronized control of multiple characters. This work proposes ActionParty, a generative world model that enables multi-agent action control by introducing state tokens representing each agent’s state and integrating a spatial bias mechanism to jointly model these states with video latent representations. This design decouples global rendering from individual action updates, allowing precise simultaneous control of up to seven agents for the first time. Evaluated on 46 environments from the Melting Pot benchmark, ActionParty significantly improves action-following accuracy and identity consistency, effectively addressing the challenges of multi-agent action binding and autoregressive tracking.
πŸ“ Abstract
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
Problem

Research questions and friction points this paper is trying to address.

action binding
multi-subject
video diffusion
world models
generative video games
Innovation

Methods, ideas, or system contributions that make the work stand out.

action binding
multi-subject control
subject state tokens
video diffusion
generative world models
πŸ”Ž Similar Papers
No similar papers found.