InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the limitations of conventional video world models, which struggle to support real-time spatial reasoning due to their reliance on sequential frame generation and window-based processing. To overcome this, the authors propose a real-time world model based on single-frame generation, introducing a novel frame-independent generation paradigm. By integrating explicit 3D anchors with an implicit spatial memory mechanism, the model maintains consistent geometric structure and visual fidelity across multiple viewpoints. A three-stage progressive training pipeline adapts a pretrained image diffusion model into a low-latency, controllable frame generator, further accelerated via few-step distillation to enable real-time interactive world simulation on consumer-grade GPUs. The method achieves significantly lower latency and higher efficiency than existing video world models while preserving high-quality rendering.

Technology Category

Application Category

📝 Abstract

We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.

Problem

Research questions and friction points this paper is trying to address.

real-time spatial inference

multi-view consistency

world models

low-latency generation

spatial intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

frame-based generation

spatial consistency

real-time inference