AstraNav-World: World Model for Foresight Control and Consistency

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of long-horizon foresight and physical consistency in embodied navigation within open, dynamic environments, this paper proposes an end-to-end world model that jointly models future visual states and action sequences, enabling simultaneous visual prediction and action planning within a probabilistic framework. Our method introduces a novel bidirectional constraint training mechanism, ensuring that visual predictions are executable and that actions are grounded in task-relevant futures—thereby departing from the conventional decoupled “imagine-then-plan” paradigm. Integrating diffusion-based video generation with vision-language policy networks, we jointly optimize action-conditioned multi-step visual forecasting and vision-conditioned trajectory generation. Extensive evaluation across multiple embodied navigation benchmarks demonstrates significant improvements in trajectory accuracy and success rates. Moreover, zero-shot transfer to real-world environments yields strong performance, validating the model’s robust generalization in spatial understanding and navigation dynamics.

Technology Category

Application Category

📝 Abstract
Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.
Problem

Research questions and friction points this paper is trying to address.

Develops world model for foresight in dynamic navigation environments
Unifies visual prediction and action planning to reduce errors
Enables zero-shot adaptation to unseen real-world scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified probabilistic framework for joint visual-action reasoning
Diffusion-based video generator integrated with vision-language policy
Bidirectional training optimizing visual predictions and action trajectories
J
Junjun Hu
Amap Alibaba
J
Jintao Chen
Amap Alibaba, PKU
H
Haochen Bai
Amap Alibaba
M
Minghua Luo
Amap Alibaba
Shichao Xie
Shichao Xie
Autonavi, alibaba group
computer visionslamvio
Z
Ziyi Chen
Amap Alibaba
F
Fei Liu
Amap Alibaba
Z
Zedong Chu
Amap Alibaba
X
Xinda Xue
Amap Alibaba, PKU
Botao Ren
Botao Ren
Tsinghua University
Computer VisionObject DetectionRemote Sensing
Xiaolong Wu
Xiaolong Wu
Georgia Institute of Technology
SLAMLocalizationRobotics
M
Mu Xu
Amap Alibaba
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models