IndoorUAV: Benchmarking Vision-Language UAV Navigation in Continuous Indoor Environments

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Research on Vision-Language Navigation (VLN) for indoor drones remains scarce, particularly lacking benchmarks and methods for long-horizon navigation in continuous 3D space and realistic UAV flight dynamics modeling. Method: We introduce IndoorUAV-Bench—the first dedicated VLN benchmark for indoor drones—comprising two subtasks: long-horizon navigation (IndoorUAV-VLN) and short-horizon manipulation (IndoorUAV-VLA). We propose IndoorUAV-Agent, a unified agent integrating task decomposition, multimodal reasoning, and UAV-specific dynamical constraints. Our pipeline generates over 16,000 high-quality expert trajectories with automated natural-language instruction annotation. Built upon Habitat, the benchmark incorporates keyframe semantic segmentation, multi-granularity instruction generation, and a multimodal fusion network. Contribution/Results: Experiments demonstrate significant improvements in instruction-following accuracy and cross-environment generalization within complex indoor scenes, establishing a foundational framework for embodied drone navigation research.

Technology Category

Application Category

📝 Abstract
Vision-Language Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations. Although most existing work has focused on ground-based robots or outdoor Unmanned Aerial Vehicles (UAVs), indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces. To bridge this gap, we introduce extbf{IndoorUAV}, a novel benchmark and method specifically tailored for VLN with indoor UAVs. We begin by curating over 1,000 diverse and structurally rich 3D indoor scenes from the Habitat simulator. Within these environments, we simulate realistic UAV flight dynamics to collect diverse 3D navigation trajectories manually, further enriched through data augmentation techniques. Furthermore, we design an automated annotation pipeline to generate natural language instructions of varying granularity for each trajectory. This process yields over 16,000 high-quality trajectories, comprising the extbf{IndoorUAV-VLN} subset, which focuses on long-horizon VLN. To support short-horizon planning, we segment long trajectories into sub-trajectories by selecting semantically salient keyframes and regenerating concise instructions, forming the extbf{IndoorUAV-VLA} subset. Finally, we introduce extbf{IndoorUAV-Agent}, a novel navigation model designed for our benchmark, leveraging task decomposition and multimodal reasoning. We hope IndoorUAV serves as a valuable resource to advance research on vision-language embodied AI in the indoor aerial navigation domain.
Problem

Research questions and friction points this paper is trying to address.

Addresses indoor UAV vision-language navigation lacking benchmarks and methods.
Creates benchmark for long and short horizon aerial navigation tasks.
Develops model for UAVs using multimodal reasoning in indoor spaces.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulated UAV flight dynamics in diverse 3D indoor scenes
Automated annotation pipeline for natural language instructions
Navigation model with task decomposition and multimodal reasoning
🔎 Similar Papers
No similar papers found.
X
Xu Liu
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Y
Yu Liu
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
H
Hanshuo Qiu
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Y
Yang Qirong
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Zhouhui Lian
Zhouhui Lian
Peking University
Computer GraphicsComputer VisionAI