INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation methods struggle to simultaneously achieve spatial consistency and visual realism in complex environments, limiting their applicability to real-time interactive navigation. This work proposes a real-time 4D world simulation framework based on spatio-temporal autoregressive (STAR) modeling, which innovatively integrates implicit spatio-temporal caching with explicit geometric constraints. Furthermore, it introduces a Joint Distribution Matching Distillation (JDMD) strategy that enables high-fidelity, dynamically interactive scene reconstruction from only a single reference video. Evaluated on the WorldScore-Dynamic benchmark, the method significantly outperforms current real-time interactive approaches, demonstrating enhanced spatial consistency and user interaction accuracy over extended navigation sequences. This approach provides a practical solution for monocular video-driven 4D environment reconstruction.
📝 Abstract
Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.
Problem

Research questions and friction points this paper is trying to address.

world models
spatial consistency
real-time interactivity
4D scene generation
video-based navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatiotemporal Autoregressive Modeling
Implicit Spatiotemporal Cache
Explicit Spatial Constraint
Joint Distribution Matching Distillation
Real-Time 4D World Simulation
🔎 Similar Papers
No similar papers found.
I
InSpatio Team
D
Donghui Shen
G
Guofeng Zhang
Haomin Liu
Haomin Liu
Sensetime
SLAMStructure from Motion
H
Haoyu Ji
H
Hujun Bao
Hongjia Zhai
Hongjia Zhai
PhD student in Computer Science, State Key Lab of CAD&CG, Zhejiang University
J
Jialin Liu
J
Jing Guo
N
Nan Wang
S
Siji Pan
W
Weihong Pan
Weijian Xie
Weijian Xie
Zhejiang University
X
Xianbin Liu
X
Xiaojun Xiang
X
Xiaoyu Zhang
X
Xinyu Chen
Yifu Wang
Yifu Wang
Tencent XR Vision Labs
Computer VisionRoboticsEvent-based VisionSLAMVisual Odometry
Y
Yipeng Chen
Z
Zhenzhou Fan
Z
Zhewen Le
Zhichao Ye
Zhichao Ye
Unknown affiliation
Z
Ziqiang Zhao