🤖 AI Summary
Existing zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) methods rely on passive perception and point-wise action selection, leading to high deployment costs, misaligned action semantics, and myopic planning. This paper introduces the first trajectory-imaginative zero-shot navigation framework for continuous environments, unifying self-correcting perception, trajectory-level planning, and active state reasoning. Specifically, an EgoView Corrector improves egocentric viewpoint alignment; a Trajectory Predictor generates globally coherent paths; and an Imagination Predictor leverages large language model priors to reason about future states. For the first time, our framework achieves end-to-end trajectory-level decision-making solely from egocentric inputs, significantly enhancing language-semantic consistency and long-horizon foresight. It establishes new zero-shot state-of-the-art performance on both the VLN-CE benchmark and real-world scenarios, improving Success Rate (SR) and Success-weighted by Path Length (SPL) by 7.49% and 18.15%, respectively, over the strongest baseline.
📝 Abstract
Vision-and-Language Navigation in Continuous Environments (VLN-CE), which links language instructions to perception and control in the real world, is a core capability of embodied robots. Recently, large-scale pretrained foundation models have been leveraged as shared priors for perception, reasoning, and action, enabling zero-shot VLN without task-specific training. However, existing zero-shot VLN methods depend on costly perception and passive scene understanding, collapsing control to point-level choices. As a result, they are expensive to deploy, misaligned in action semantics, and short-sighted in planning. To address these issues, we present DreamNav that focuses on the following three aspects: (1) for reducing sensory cost, our EgoView Corrector aligns viewpoints and stabilizes egocentric perception; (2) instead of point-level actions, our Trajectory Predictor favors global trajectory-level planning to better align with instruction semantics; and (3) to enable anticipatory and long-horizon planning, we propose an Imagination Predictor to endow the agent with proactive thinking capability. On VLN-CE and real-world tests, DreamNav sets a new zero-shot state-of-the-art (SOTA), outperforming the strongest egocentric baseline with extra information by up to 7.49% and 18.15% in terms of SR and SPL metrics. To our knowledge, this is the first zero-shot VLN method to unify trajectory-level planning and active imagination while using only egocentric inputs.