SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work addresses the challenge of enhancing vision-and-language navigation (VLN) performance in unseen 3D environments by enabling agents to follow natural language instructions more accurately. To this end, the authors propose a bidirectional spatial-aware learning mechanism that jointly leverages an action retrospection task to understand the rationale behind past actions (“why”) and a future-frame selection task to anticipate subsequent steps (“how”). Furthermore, they introduce TriPA—a lightweight, three-factor progressive adaptive curriculum learning strategy—that effectively activates the model’s spatial reasoning capabilities with minimal supervision. Evaluated on the VLN-CE benchmark, the proposed approach achieves state-of-the-art navigation performance, demonstrating significant improvements over existing methods.
📝 Abstract
Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring such awareness, namely backward action reasoning (why) and forward transition prediction~(how). Based on this insight, we propose SpaAct, a simple yet effective training framework that activates the dynamic spatial awareness in VLMs. Specifically, SpaAct introduces two spatial activation tasks: Action Retrospection, which asks the model to infer the executed action sequence from visual transitions, and Future Frame Selection, which forces the model to predict the visual transitions conditioned on history and action. These two objectives provide lightweight supervision on both backward action reasoning and forward transition prediction, encouraging the model to build dynamic spatial awareness in a VLM-friendly way. To further stabilize adaptation, we design TriPA, a Tri-factor Progressive Adaptive curriculum learning method that organizes training samples from easy to hard, allowing the model to gradually acquire navigation skills from basic locomotion to long-horizon reasoning. Experiments on standard VLN-CE benchmarks show that SpaAct consistently improves VLM-based navigation and achieves state-of-the-art performance. We will release the code and models to support future research.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
Spatial Awareness
Embodied AI
Visual Transitions
Natural Language Instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially-Activated Learning
Backward Action Reasoning
Forward Transition Prediction
Curriculum Adaptation
Vision-Language Navigation
🔎 Similar Papers
No similar papers found.
P
Pengna Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
K
Kangyi Wu
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Shaoqing Xu
Shaoqing Xu
University of Macau, BUAA, Xiaomi EV
3D Computer Vision3D GenerationVision and Language ModelEnd2EndWorld Model
F
Fang Li
The State Key Laboratory of Internet of Things for Smart City, Centre for Artificial Intelligence and Robotics, University of Macau
H
Hanbing Li
Xiaomi EV
Lin Zhao
Lin Zhao
Beijing Institute of Technology; JD Explore Academy
Embodied AIRobot Learning
K
Kailin Lyu
Institute of Automation, Chinese Academy of Sciences, China
L
Long Chen
Xiaomi EV
Zhi-Xin Yang
Zhi-Xin Yang
University of Macau
Intelligent Fault Diagnosis & MaintenanceRobotics Vision and Control for Safety Monitoring
Nanning Zheng
Nanning Zheng
Xi'an Jiaotong University