SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the mismatch between 2D perception and 3D decision-making in vision-and-language navigation for drones operating in complex urban environments. The authors propose SpatialFly, a framework that relies solely on RGB inputs and introduces a geometry-guided 2D representation alignment mechanism to inject global geometric priors into semantic features. By integrating cross-modal attention, the method implicitly aligns 2D semantics with 3D geometry without requiring explicit 3D reconstruction. Key innovations include geometric prior injection, geometry-aware reparameterization, and gated residual fusion, which collectively enhance spatial reasoning capabilities. Experiments demonstrate that SpatialFly outperforms state-of-the-art methods in both seen and unseen scenarios, reducing path error by 4.03 meters and improving success rate by 1.27% on the Full unseen test set, while generating smoother trajectories with better path alignment.

Technology Category

Application Category

📝 Abstract
UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.
Problem

Research questions and friction points this paper is trying to address.

UAV VLN
representation mismatch
spatial reasoning
3D navigation
urban environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry-guided representation
UAV vision-and-language navigation
2D-3D alignment
cross-modal attention
semantic-geometric fusion
🔎 Similar Papers
No similar papers found.
W
Wen Jiang
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
Kangyao Huang
Kangyao Huang
Tsinghua University
Robot LearningAerial Robotics
Li Wang
Li Wang
北京理工大学
自动驾驶;机器人;具身智能;人机协同;智能感知
Wang Xu
Wang Xu
Harbin Institute of Technology
natural language processingartificial intelligence
W
Wei Fan
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
Jinyuan Liu
Jinyuan Liu
Dalian University of Technology
image processingdeep learningimage fusion
S
Shaoyu Liu
School of Artificial Intelligence, Xidian University, Xi’an 710071, China
H
Hanfang Liang
Jianghan University, Wuhan 430056, China
Hongwei Duan
Hongwei Duan
Nanyang Technological University
colloids and interfacesin vitro diagnosticsplasmonicsnanomedicine
B
Bin Xu
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
X
Xiangyang Ji
Department of Automation, Tsinghua University, Beijing 100084, China