P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the instability in zero-shot vision-and-language navigation caused by the entanglement of directional reasoning and local goal prediction. To resolve this, the authors propose a hierarchical navigation framework that explicitly decouples global direction selection from local target localization: it first selects a navigation direction based on panoramic observations and then performs fine-grained localization in a top-down map. The approach incorporates a sliding-window multi-turn dialogue memory module to preserve long-range semantic consistency and introduces a reliability-aware reflective reorientation mechanism that dynamically evaluates localization confidence and triggers redirection when needed. Evaluated on the R2R-CE benchmark, the method achieves a 146.6% relative improvement (an absolute gain of 58.9%) in success rate over existing zero-shot approaches, demonstrating its effectiveness and robustness.
📝 Abstract
Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
Zero-shot Learning
Waypoint Prediction
Navigation Decision-making
Embodied AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot vision-and-language navigation
Panorama-to-Downview reasoning
hierarchical navigation framework
reflective reorientation
sliding-window dialogue memory
🔎 Similar Papers
No similar papers found.
K
Kai Sheng
Department of Control Science and Engineering, Tongji University, Shanghai 210804, China
Liuyi Wang
Liuyi Wang
Tongji University
computer visionnatural language processingartificial intelligence
H
Haojie Dai
Department of Control Science and Engineering, Tongji University, Shanghai 210804, China
Jinlong Li
Jinlong Li
University of Science and Technology of China
Data miningmachine learningdeep learningbig data
Y
Yongrui Qin
Department of Control Science and Engineering, Tongji University, Shanghai 210804, China
Z
Zongtao He
Department of Control Science and Engineering, Tongji University, Shanghai 210804, China
C
Chengju Liu
Department of Control Science and Engineering, Tongji University, Shanghai 210804, China
Q
Qijun Chen
Department of Control Science and Engineering, Tongji University, Shanghai 210804, China