FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the limitations of existing vision-and-language navigation (VLN) approaches, which rely on verbose instructions and lack explicit modeling of global spatial structure. To overcome these challenges, we propose FloorPlan-VLN, a novel paradigm that leverages semantically annotated architectural floorplans as global priors in conjunction with concise natural language instructions. We introduce a new dataset comprising 72 scenes and over 10,000 navigation trajectories, along with FP-Nav, a method that enables end-to-end joint learning across observations, floorplans, and instructions through dual-view spatiotemporal alignment of video sequences and auxiliary reasoning tasks. Experimental results demonstrate that our approach achieves a navigation success rate surpassing the current state of the art by over 60% on the proposed benchmark, while exhibiting strong robustness to execution errors and map distortions.

Technology Category

Application Category

📝 Abstract

Existing Vision-Language Navigation (VLN) task requires agents to follow verbose instructions, ignoring some potentially useful global spatial priors, limiting their capability to reason about spatial structures. Although human-readable spatial schematics (e.g., floor plans) are ubiquitous in real-world buildings, current agents lack the cognitive ability to comprehend and utilize them. To bridge this gap, we introduce \textbf{FloorPlan-VLN}, a new paradigm that leverages structured semantic floor plans as global spatial priors to enable navigation with only concise instructions. We first construct the FloorPlan-VLN dataset, which comprises over 10k episodes across 72 scenes. It pairs more than 100 semantically annotated floor plans with Matterport3D-based navigation trajectories and concise instructions that omit step-by-step guidance. Then, we propose a simple yet effective method \textbf{FP-Nav} that uses a dual-view, spatio-temporally aligned video sequence, and auxiliary reasoning tasks to align observations, floor plans, and instructions. When evaluated under this new benchmark, our method significantly outperforms adapted state-of-the-art VLN baselines, achieving more than a 60\% relative improvement in navigation success rate. Furthermore, comprehensive noise modeling and real-world deployments demonstrate the feasibility and robustness of FP-Nav to actuation drift and floor plan distortions. These results validate the effectiveness of floor plan guided navigation and highlight FloorPlan-VLN as a promising step toward more spatially intelligent navigation.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Floor Plan

Spatial Reasoning

Global Spatial Priors

Navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

floor plan guided navigation

vision-language navigation

spatial reasoning