🤖 AI Summary
This work addresses the challenge of enhancing vision-and-language navigation (VLN) performance in unseen 3D environments by enabling agents to follow natural language instructions more accurately. To this end, the authors propose a bidirectional spatial-aware learning mechanism that jointly leverages an action retrospection task to understand the rationale behind past actions (“why”) and a future-frame selection task to anticipate subsequent steps (“how”). Furthermore, they introduce TriPA—a lightweight, three-factor progressive adaptive curriculum learning strategy—that effectively activates the model’s spatial reasoning capabilities with minimal supervision. Evaluated on the VLN-CE benchmark, the proposed approach achieves state-of-the-art navigation performance, demonstrating significant improvements over existing methods.
📝 Abstract
Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring such awareness, namely backward action reasoning (why) and forward transition prediction~(how). Based on this insight, we propose SpaAct, a simple yet effective training framework that activates the dynamic spatial awareness in VLMs. Specifically, SpaAct introduces two spatial activation tasks: Action Retrospection, which asks the model to infer the executed action sequence from visual transitions, and Future Frame Selection, which forces the model to predict the visual transitions conditioned on history and action. These two objectives provide lightweight supervision on both backward action reasoning and forward transition prediction, encouraging the model to build dynamic spatial awareness in a VLM-friendly way. To further stabilize adaptation, we design TriPA, a Tri-factor Progressive Adaptive curriculum learning method that organizes training samples from easy to hard, allowing the model to gradually acquire navigation skills from basic locomotion to long-horizon reasoning. Experiments on standard VLN-CE benchmarks show that SpaAct consistently improves VLM-based navigation and achieves state-of-the-art performance. We will release the code and models to support future research.