DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-and-Language Navigation

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) methods rely on passive perception and point-wise action selection, leading to high deployment costs, misaligned action semantics, and myopic planning. This paper introduces the first trajectory-imaginative zero-shot navigation framework for continuous environments, unifying self-correcting perception, trajectory-level planning, and active state reasoning. Specifically, an EgoView Corrector improves egocentric viewpoint alignment; a Trajectory Predictor generates globally coherent paths; and an Imagination Predictor leverages large language model priors to reason about future states. For the first time, our framework achieves end-to-end trajectory-level decision-making solely from egocentric inputs, significantly enhancing language-semantic consistency and long-horizon foresight. It establishes new zero-shot state-of-the-art performance on both the VLN-CE benchmark and real-world scenarios, improving Success Rate (SR) and Success-weighted by Path Length (SPL) by 7.49% and 18.15%, respectively, over the strongest baseline.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation in Continuous Environments (VLN-CE), which links language instructions to perception and control in the real world, is a core capability of embodied robots. Recently, large-scale pretrained foundation models have been leveraged as shared priors for perception, reasoning, and action, enabling zero-shot VLN without task-specific training. However, existing zero-shot VLN methods depend on costly perception and passive scene understanding, collapsing control to point-level choices. As a result, they are expensive to deploy, misaligned in action semantics, and short-sighted in planning. To address these issues, we present DreamNav that focuses on the following three aspects: (1) for reducing sensory cost, our EgoView Corrector aligns viewpoints and stabilizes egocentric perception; (2) instead of point-level actions, our Trajectory Predictor favors global trajectory-level planning to better align with instruction semantics; and (3) to enable anticipatory and long-horizon planning, we propose an Imagination Predictor to endow the agent with proactive thinking capability. On VLN-CE and real-world tests, DreamNav sets a new zero-shot state-of-the-art (SOTA), outperforming the strongest egocentric baseline with extra information by up to 7.49% and 18.15% in terms of SR and SPL metrics. To our knowledge, this is the first zero-shot VLN method to unify trajectory-level planning and active imagination while using only egocentric inputs.
Problem

Research questions and friction points this paper is trying to address.

Reducing sensory costs for zero-shot vision-language navigation deployment
Aligning action semantics with instruction through trajectory-level planning
Enabling anticipatory long-horizon planning with proactive thinking capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

EgoView Corrector stabilizes egocentric perception
Trajectory Predictor enables global trajectory-level planning
Imagination Predictor provides proactive thinking capability
🔎 Similar Papers
No similar papers found.
Y
Yunheng Wang
The Hong Kong University of Science and Technology (Guangzhou)
Yuetong Fang
Yuetong Fang
Ph.D. Student, HKUST(GZ)
Brain-inspired computingNeuromorphic ComputingEmbodied AI
T
Taowen Wang
The Hong Kong University of Science and Technology (Guangzhou)
Y
Yixiao Feng
The Hong Kong University of Science and Technology (Guangzhou)
Y
Yawen Tan
Zhejiang Normal University
Shuning Zhang
Shuning Zhang
Tsinghua University
HCIUsable Privacy and SecurityAI
P
Peiran Liu
The Hong Kong University of Science and Technology (Guangzhou)
Y
Yiding Ji
The Hong Kong University of Science and Technology (Guangzhou)
Renjing Xu
Renjing Xu
HKUST(GZ)
Brain-inspired ComputingHumanoid Computing