🤖 AI Summary
Existing interactive video world models struggle with fine-grained multi-entity control and cross-entity, cross-world generalization due to action interfaces tightly coupled to specific entities or simulation engines. This work proposes a video world model that uses natural language as the action interface, enabling language-conditioned control at every latent frame (0.25 seconds) and achieving, for the first time, simultaneous multi-entity manipulation and concept-level transfer. The approach integrates a pretrained bidirectional video backbone, frame-local text cross-attention, Self-Forcing distillation initialized with ODEs, and RoPE-based decoupled sliding KV caching. Experiments demonstrate an 89% cross-entity transfer accuracy (versus 43% for the baseline), a 90% success rate on out-of-vocabulary prompts (baseline: 0%), and stable inference at 19.7 FPS in 480p resolution with a two-step student model, maintaining consistent FVD over 2-hour rollouts.
📝 Abstract
Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.