FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision-and-language navigation (VLN) approaches, which rely on verbose instructions and lack explicit modeling of global spatial structure. To overcome these challenges, we propose FloorPlan-VLN, a novel paradigm that leverages semantically annotated architectural floorplans as global priors in conjunction with concise natural language instructions. We introduce a new dataset comprising 72 scenes and over 10,000 navigation trajectories, along with FP-Nav, a method that enables end-to-end joint learning across observations, floorplans, and instructions through dual-view spatiotemporal alignment of video sequences and auxiliary reasoning tasks. Experimental results demonstrate that our approach achieves a navigation success rate surpassing the current state of the art by over 60% on the proposed benchmark, while exhibiting strong robustness to execution errors and map distortions.

Technology Category

Application Category

📝 Abstract
Existing Vision-Language Navigation (VLN) task requires agents to follow verbose instructions, ignoring some potentially useful global spatial priors, limiting their capability to reason about spatial structures. Although human-readable spatial schematics (e.g., floor plans) are ubiquitous in real-world buildings, current agents lack the cognitive ability to comprehend and utilize them. To bridge this gap, we introduce \textbf{FloorPlan-VLN}, a new paradigm that leverages structured semantic floor plans as global spatial priors to enable navigation with only concise instructions. We first construct the FloorPlan-VLN dataset, which comprises over 10k episodes across 72 scenes. It pairs more than 100 semantically annotated floor plans with Matterport3D-based navigation trajectories and concise instructions that omit step-by-step guidance. Then, we propose a simple yet effective method \textbf{FP-Nav} that uses a dual-view, spatio-temporally aligned video sequence, and auxiliary reasoning tasks to align observations, floor plans, and instructions. When evaluated under this new benchmark, our method significantly outperforms adapted state-of-the-art VLN baselines, achieving more than a 60\% relative improvement in navigation success rate. Furthermore, comprehensive noise modeling and real-world deployments demonstrate the feasibility and robustness of FP-Nav to actuation drift and floor plan distortions. These results validate the effectiveness of floor plan guided navigation and highlight FloorPlan-VLN as a promising step toward more spatially intelligent navigation.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
Floor Plan
Spatial Reasoning
Global Spatial Priors
Navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

floor plan guided navigation
vision-language navigation
spatial reasoning
semantic floor plans
FP-Nav
🔎 Similar Papers
No similar papers found.
K
Kehan Chen
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences and School of Artificial Intelligence, University of Chinese Academy of Sciences, China
Yan Huang
Yan Huang
Institute of Automation, Chinese Academy of Sciences
computer visiondeep learningmultimodal learning
D
Dong An
AMap, Alibaba Group
Jiawei He
Jiawei He
Borealis AI
Generative ModelsTime-Series Analysis
Yifei Su
Yifei Su
Institute of Automation, Chinese Academy of Sciences
Embodied AIMultimodal Learning
J
Jing Liu
FiveAges
N
Nianfeng Liu
FiveAges
Liang Wang
Liang Wang
National Lab of Pattern Recognition
Computer VisionPattern RecognitionMachine Learning