🤖 AI Summary
Existing video generation models face limitations in real-time interactivity, long-term temporal consistency, and persistent memory of dynamic scenes, hindering their applicability as practical world models. This work proposes TeleWorld—a real-time multimodal 4D world modeling framework that introduces a novel “generate–reconstruct–guide” closed-loop paradigm. By integrating Macro-from-Micro Planning and Distribution Matching Distillation within a unified architecture, TeleWorld jointly models static environments and dynamic objects. Built upon an autoregressive diffusion video model, hierarchical planning (MMPL), efficient distribution matching distillation (DMD), and 4D spatiotemporal representations, the method significantly outperforms existing approaches in static and dynamic world understanding, temporal coherence, and real-time generation efficiency, thereby advancing the practical deployment of interactive embodied intelligent world models.
📝 Abstract
World models aim to endow AI systems with the ability to represent, generate, and interact with dynamic environments in a coherent and temporally consistent manner. While recent video generation models have demonstrated impressive visual quality, they remain limited in real-time interaction, long-horizon consistency, and persistent memory of dynamic scenes, hindering their evolution into practical world models. In this report, we present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system. TeleWorld introduces a novel generation-reconstruction-guidance paradigm, where generated video streams are continuously reconstructed into a dynamic 4D spatio-temporal representation, which in turn guides subsequent generation to maintain spatial, temporal, and physical consistency. To support long-horizon generation with low latency, we employ an autoregressive diffusion-based video model enhanced with Macro-from-Micro Planning (MMPL)--a hierarchical planning method that reduces error accumulation from frame-level to segment-level-alongside efficient Distribution Matching Distillation (DMD), enabling real-time synthesis under practical computational budgets. Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible systems. Extensive experiments demonstrate that TeleWorld achieves strong performance in both static and dynamic world understanding, long-term consistency, and real-time generation efficiency, positioning it as a practical step toward interactive, memory-enabled world models for multimodal generation and embodied intelligence.