Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

๐Ÿ“… 2026-02-06
๐Ÿ“ˆ Citations: 1
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of achieving seamless, instruction-driven navigation from outdoor to indoor environments without reliance on precise prior informationโ€”a capability lacking in existing embodied navigation methods. We formally define, for the first time, an outdoor-to-indoor vision-and-language navigation task that operates without external priors and propose a vision-centric, end-to-end framework that relies solely on egocentric visual inputs and natural language instructions to perform continuous navigation. Key contributions include the creation of the first open-source dataset for this task, the introduction of trajectory-conditioned video synthesis for scalable data generation, and the design of an image-prompt-based decision-making mechanism. Experimental results demonstrate that our approach significantly outperforms state-of-the-art baselines in terms of both success rate and path efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.
Problem

Research questions and friction points this paper is trying to address.

embodied navigation
indoor-outdoor transition
instruction-guided
last-meter delivery
vision-centric
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-centric navigation
instruction-guided embodied navigation
out-to-in navigation
prior-free navigation
trajectory-conditioned video synthesis
๐Ÿ”Ž Similar Papers
No similar papers found.
Yuxiang Zhao
Yuxiang Zhao
Shanghai Jiao Tong University
text-to-speechartificial intelligencedeepfake detection
Y
Yirong Yang
AMAP CV Lab, Alibaba Group
Y
Yanqing Zhu
AMAP CV Lab, Alibaba Group
Y
Yanfen Shen
AMAP CV Lab, Alibaba Group
C
Chiyu Wang
AMAP CV Lab, Alibaba Group
Zhining Gu
Zhining Gu
Arizona State University
GISDeep LearningMachine Learning
P
Pei Shi
AMAP CV Lab, Alibaba Group
W
Wei Guo
AMAP CV Lab, Alibaba Group
Mu Xu
Mu Xu
alibaba
CV LLM VLM VLA RL