AgentVLN: Towards Agentic Vision-and-Language Navigation

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in vision-language navigation—limited spatial awareness, misalignment between 2D and 3D representations, and monocular scale ambiguity—by proposing the VLM-as-Brain architecture. The approach formulates navigation as a partially observable semi-Markov decision process (POSMDP) and decouples perception from planning via a plug-and-play skill library. It introduces cross-space representation mapping to align 3D waypoints with image pixels and innovatively designs a query-driven perceptual chain-of-thought (QD-PCoT), context-aware self-correction, and active exploration mechanisms. A large-scale, dynamically routed instruction-tuning dataset, AgentVLN-Instruct, is constructed to endow agents with metacognitive capabilities and active geometric depth acquisition. The method significantly outperforms existing state-of-the-art approaches on multiple long-horizon navigation benchmarks while supporting efficient edge deployment, demonstrating both high performance and practical utility.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D-3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into the image plane, yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware self-correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next-generation embodied navigation models. Code: https://github.com/Allenxinn/AgentVLN.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
spatial perception
2D-3D representation mismatch
monocular scale ambiguity
long-horizon navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-and-Language Navigation
VLM-as-Brain
Cross-space Representation Mapping
Query-Driven Perceptual Chain-of-Thought
POSMDP
🔎 Similar Papers
Z
Zihao Xin
Nanjing University of Aeronautics and Astronautics
Wentong Li
Wentong Li
Nanjing University of Aeronautics and Astronautics
Computer VisionMachine LearningVision-Language ModelRobotics
Y
Yixuan Jiang
Nanjing University of Aeronautics and Astronautics
Z
Ziyuan Huang
Nanjing University of Aeronautics and Astronautics
Bin Wang
Bin Wang
Shandong University of Technology; Shandong University
Video Action RecognitionComputer VisionAnomaly DetectionAction Predictionetc.
P
Piji Li
Nanjing University of Aeronautics and Astronautics
Jianke Zhu
Jianke Zhu
Professor of Computer Science, Zhejiang University
Computer VisionRobotics
Jie Qin
Jie Qin
Professor, Nanjing University of Aeronautics and Astronautics
Computer VisionMachine LearningPattern Recognition
S
Shengjun Huang
Nanjing University of Aeronautics and Astronautics