🤖 AI Summary
Existing pre-trained video generation models rely on static prompts (e.g., text or images), limiting their ability to model interactive, dynamic scenes. To address this, we propose the Dynamic World Simulation (DWS) framework, which transforms video generation models into interactive world simulators. DWS introduces a lightweight, universal action-conditioning module that drives scene evolution according to given action trajectories; a motion-augmented loss that explicitly optimizes dynamic consistency—rather than pixel-level fidelity; and a priority imagination sampling strategy to enhance long-horizon temporal controllability. The framework is architecture-agnostic, supporting both diffusion models and autoregressive Transformers. Experiments demonstrate that DWS significantly improves action controllability and dynamic coherence in game and robotics simulation scenarios. Moreover, when applied to downstream model-predictive control tasks, DWS achieves state-of-the-art sample efficiency.
📝 Abstract
Video generative models pre-trained on large-scale internet datasets have achieved remarkable success, excelling at producing realistic synthetic videos. However, they often generate clips based on static prompts (e.g., text or images), limiting their ability to model interactive and dynamic scenarios. In this paper, we propose Dynamic World Simulation (DWS), a novel approach to transform pre-trained video generative models into controllable world simulators capable of executing specified action trajectories. To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module that seamlessly integrates into any existing model. Instead of focusing on complex visual details, we demonstrate that consistent dynamic transition modeling is the key to building powerful world simulators. Building upon this insight, we further introduce a motion-reinforced loss that enhances action controllability by compelling the model to capture dynamic changes more effectively. Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models, achieving significant improvements in generating action-controllable, dynamically consistent videos across games and robotics domains. Moreover, to facilitate the applications of the learned world simulator in downstream tasks such as model-based reinforcement learning, we propose prioritized imagination to improve sample efficiency, demonstrating competitive performance compared with state-of-the-art methods.