MapDream: Task-Driven Map Learning for Vision-Language Navigation

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work proposes MapDream, a novel framework that addresses the limitations of existing vision-and-language navigation (VLN) methods, which rely on handcrafted maps decoupled from the agent’s policy and struggle to effectively aggregate task-relevant spatial context. MapDream formulates map construction as a navigation-driven, autoregressive bird’s-eye-view (BEV) generation process, enabling end-to-end joint learning of map generation and action prediction. This ensures that the generated map retains only navigation-critical information. By integrating supervised pretraining with reinforcement-based fine-tuning, the approach achieves state-of-the-art performance in monocular VLN on the R2R-CE and RxR-CE benchmarks, significantly enhancing the alignment between map representation and control policy.

Technology Category

Application Category

📝 Abstract

Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

map representation

task-driven learning

3D environment

navigation policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

task-driven map learning

vision-language navigation

autoregressive BEV synthesis