INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing video generation methods struggle to simultaneously achieve spatial consistency and visual realism in complex environments, limiting their applicability to real-time interactive navigation. This work proposes a real-time 4D world simulation framework based on spatio-temporal autoregressive (STAR) modeling, which innovatively integrates implicit spatio-temporal caching with explicit geometric constraints. Furthermore, it introduces a Joint Distribution Matching Distillation (JDMD) strategy that enables high-fidelity, dynamically interactive scene reconstruction from only a single reference video. Evaluated on the WorldScore-Dynamic benchmark, the method significantly outperforms current real-time interactive approaches, demonstrating enhanced spatial consistency and user interaction accuracy over extended navigation sequences. This approach provides a practical solution for monocular video-driven 4D environment reconstruction.

Technology Category

Application Category

📝 Abstract

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

Problem

Research questions and friction points this paper is trying to address.

world models

spatial consistency

real-time interactivity

4D scene generation

video-based navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatiotemporal Autoregressive Modeling

Implicit Spatiotemporal Cache

Explicit Spatial Constraint