InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional video world models, which struggle to support real-time spatial reasoning due to their reliance on sequential frame generation and window-based processing. To overcome this, the authors propose a real-time world model based on single-frame generation, introducing a novel frame-independent generation paradigm. By integrating explicit 3D anchors with an implicit spatial memory mechanism, the model maintains consistent geometric structure and visual fidelity across multiple viewpoints. A three-stage progressive training pipeline adapts a pretrained image diffusion model into a low-latency, controllable frame generator, further accelerated via few-step distillation to enable real-time interactive world simulation on consumer-grade GPUs. The method achieves significantly lower latency and higher efficiency than existing video world models while preserving high-quality rendering.

Technology Category

Application Category

📝 Abstract
We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.
Problem

Research questions and friction points this paper is trying to address.

real-time spatial inference
multi-view consistency
world models
low-latency generation
spatial intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

frame-based generation
spatial consistency
real-time inference
3D anchors
few-step distillation
🔎 Similar Papers
No similar papers found.
I
InSpatio Team
X
Xiaoyu Zhang
W
Weihong Pan
Zhichao Ye
Zhichao Ye
Unknown affiliation
J
Jialin Liu
Y
Yipeng Chen
N
Nan Wang
X
Xiaojun Xiang
Weijian Xie
Weijian Xie
Zhejiang University
Yifu Wang
Yifu Wang
Tencent XR Vision Labs
Computer VisionRoboticsEvent-based VisionSLAMVisual Odometry
H
Haoyu Ji
S
Siji Pan
Z
Zhewen Le
J
Jing Guo
X
Xianbin Liu
D
Donghui Shen
Z
Ziqiang Zhao
Haomin Liu
Haomin Liu
Sensetime
SLAMStructure from Motion
G
Guofeng Zhang