WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language navigation methods suffer from unstable trajectory prediction under single first-person observations and lack structured supervisory signals for navigation. This work proposes WorldMAP, a novel framework that, for the first time, leverages a generative world model to transform future video predictions into semantic-spatial memory and combines explicit planning to generate structured pseudo-labels. A lightweight multi-hypothesis trajectory predictor is trained via a teacher-student mechanism, enabling end-to-end stable navigation. Evaluated on Target-Bench, the proposed method significantly outperforms baseline approaches, reducing ADE and FDE by 18.0% and 42.1%, respectively, and enables a small open-source vision-language model to achieve DTW performance comparable to that of closed-source counterparts.
📝 Abstract
Vision-language models (VLMs) and generative world models are opening new opportunities for embodied navigation. VLMs are increasingly used as direct planners or trajectory predictors, while world models support look-ahead reasoning by imagining future views. Yet predicting a reliable trajectory from a single egocentric observation remains challenging. Current VLMs often generate unstable trajectories, and world models, though able to synthesize plausible futures, do not directly provide the grounded signals needed for navigation learning. This raises a central question: how can generated futures be turned into supervision for grounded trajectory prediction? We present WorldMAP, a teacher--student framework that converts world-model-generated futures into persistent semantic-spatial structure and planning-derived supervision. Its world-model-driven teacher builds semantic-spatial memory from generated videos, grounds task-relevant targets and obstacles, and produces trajectory pseudo-labels through explicit planning. A lightweight student with a multi-hypothesis trajectory head is then trained to predict navigation trajectories directly from vision-language inputs. On Target-Bench, WorldMAP achieves the best ADE and FDE among compared methods, reducing ADE by 18.0% and FDE by 42.1% relative to the best competing baseline, while lifting a small open-source VLM to DTW performance competitive with proprietary models. More broadly, the results suggest that, in embodied navigation, the value of world models may lie less in supplying action-ready imagined evidence than in synthesizing structured supervision for navigation learning.
Problem

Research questions and friction points this paper is trying to address.

vision-language navigation
trajectory prediction
generative world models
supervision signal
embodied navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

world models
vision-language navigation
trajectory prediction
structured supervision
teacher-student framework