🤖 AI Summary
This work addresses the limitations of existing video world models in interactivity and multi-user collaboration, which hinder environment editing and shared reasoning. The authors propose a diffusion-based game engine architecture that decouples the generative process into three modules—memory storage, observation modeling, and dynamics prediction—by introducing an explicit, editable external memory mechanism. This design enables fine-grained user control over environmental structure and naturally supports real-time multi-player interaction with consistent state synchronization and shared viewpoints. Experimental results demonstrate that the proposed approach significantly enhances the editability, reproducibility, and multi-agent consistency of dynamic world generation.
📝 Abstract
Video world models have shown immense promise for interactive simulation and entertainment, but current systems still struggle with two important aspects of interactivity: user control over the environment for reproducible, editable experiences, and shared inference where players hold influence over a common world. To address these limitations, we introduce an explicit external memory into the system, a persistent state operating independent of the model's context window, that is continually updated by user actions and queried throughout the generation roll-out. Unlike conventional diffusion game engines that operate as next-frame predictors, our approach decomposes generation into Memory, Observation, and Dynamics modules. This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real-time multiplayer rollouts with coherent viewpoints and consistent cross-player interactions.