🤖 AI Summary
This work proposes MapDream, a novel framework that addresses the limitations of existing vision-and-language navigation (VLN) methods, which rely on handcrafted maps decoupled from the agent’s policy and struggle to effectively aggregate task-relevant spatial context. MapDream formulates map construction as a navigation-driven, autoregressive bird’s-eye-view (BEV) generation process, enabling end-to-end joint learning of map generation and action prediction. This ensures that the generated map retains only navigation-critical information. By integrating supervised pretraining with reinforcement-based fine-tuning, the approach achieves state-of-the-art performance in monocular VLN on the R2R-CE and RxR-CE benchmarks, significantly enhancing the alignment between map representation and control policy.
📝 Abstract
Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.