MapDream: Task-Driven Map Learning for Vision-Language Navigation

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes MapDream, a novel framework that addresses the limitations of existing vision-and-language navigation (VLN) methods, which rely on handcrafted maps decoupled from the agent’s policy and struggle to effectively aggregate task-relevant spatial context. MapDream formulates map construction as a navigation-driven, autoregressive bird’s-eye-view (BEV) generation process, enabling end-to-end joint learning of map generation and action prediction. This ensures that the generated map retains only navigation-critical information. By integrating supervised pretraining with reinforcement-based fine-tuning, the approach achieves state-of-the-art performance in monocular VLN on the R2R-CE and RxR-CE benchmarks, significantly enhancing the alignment between map representation and control policy.

Technology Category

Application Category

📝 Abstract
Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
map representation
task-driven learning
3D environment
navigation policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

task-driven map learning
vision-language navigation
autoregressive BEV synthesis
map-in-the-loop
navigation affordances
🔎 Similar Papers
No similar papers found.
G
Guoxin Lian
Renmin University of China
Shuo Wang
Shuo Wang
Renmin University of China
NavigationSLAM3D visionVLM
Yucheng Wang
Yucheng Wang
ETH Zürich
Multimodal LLMSpeech UnderstandingHuman-Computer Interaction
Y
Yongcai Wang
Renmin University of China
M
Maiyue Chen
Horizon Robotics
K
Kaihui Wang
Horizon Robotics
B
Bo Zhang
Horizon Robotics
Zhizhong Su
Zhizhong Su
Horizon Robotics
Deep LearningComputer VisionAutonomous DrivingRobotics Learning
D
Deying Li
Renmin University of China
Z
Zhaoxin Fan
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing