🤖 AI Summary
This work addresses key challenges in vision-language navigation—limited spatial awareness, misalignment between 2D and 3D representations, and monocular scale ambiguity—by proposing the VLM-as-Brain architecture. The approach formulates navigation as a partially observable semi-Markov decision process (POSMDP) and decouples perception from planning via a plug-and-play skill library. It introduces cross-space representation mapping to align 3D waypoints with image pixels and innovatively designs a query-driven perceptual chain-of-thought (QD-PCoT), context-aware self-correction, and active exploration mechanisms. A large-scale, dynamically routed instruction-tuning dataset, AgentVLN-Instruct, is constructed to endow agents with metacognitive capabilities and active geometric depth acquisition. The method significantly outperforms existing state-of-the-art approaches on multiple long-horizon navigation benchmarks while supporting efficient edge deployment, demonstrating both high performance and practical utility.
📝 Abstract
Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D-3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into the image plane, yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware self-correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next-generation embodied navigation models. Code: https://github.com/Allenxinn/AgentVLN.