An Efficient and Multi-Modal Navigation System with One-Step World Model

📅 2026-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing learning-based navigation methods, which exhibit constrained capabilities in 3D spatial reasoning and dynamic modeling, and suffer from high computational latency due to multi-step generative world models that hinder real-time deployment. To overcome these challenges, we propose a lightweight, single-step generative world model that integrates a 3D U-Net architecture with an efficient spatiotemporal attention mechanism, coupled with an anchor-initialized optimization-based planning framework for efficient multimodal goal-oriented navigation. By abandoning conventional multi-step diffusion or autoregressive generation paradigms, our approach substantially reduces inference latency while maintaining high prediction accuracy to support high-frequency closed-loop control. Experimental results demonstrate that the proposed system outperforms state-of-the-art methods in both simulation and real-world environments, achieving significant improvements in efficiency and robustness.

Technology Category

Application Category

📝 Abstract
Navigation is a fundamental capability for mobile robots. While the current trend is to use learning-based approaches to replace traditional geometry-based methods, existing end-to-end learning-based policies often struggle with 3D spatial reasoning and lack a comprehensive understanding of physical world dynamics. Integrating world models-which predict future observations conditioned on given actions-with iterative optimization planning offers a promising solution due to their capacity for imagination and flexibility. However, current navigation world models, typically built on pure transformer architectures, often rely on multi-step diffusion processes and autoregressive frame-by-frame generation. These mechanisms result in prohibitive computational latency, rendering real-time deployment impossible. To address this bottleneck, we propose a lightweight navigation world model that adopts a one-step generation paradigm and a 3D U-Net backbone equipped with efficient spatial-temporal attention. This design drastically reduces inference latency, enabling high-frequency control while achieving superior predictive performance. We also integrate this model into an optimization-based planning framework utilizing anchor-based initialization to handle multi-modal goal navigation tasks. Extensive closed-loop experiments in both simulation and real-world environments demonstrate our system's superior efficiency and robustness compared to state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

navigation
world model
real-time deployment
computational latency
3D spatial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

one-step world model
3D U-Net
spatio-temporal attention
multi-modal navigation
real-time planning
🔎 Similar Papers
No similar papers found.
W
Wangtian Shen
Tsinghua University
Z
Ziyang Meng
Tsinghua University
Jinming Ma
Jinming Ma
University of Science and Technology of China
reinforcement learning
M
Mingliang Zhou
Xiaomi (China)
D
Diyun Xiang
Xiaomi (China)