🤖 AI Summary
This work addresses the mismatch between 2D perception and 3D decision-making in vision-and-language navigation for drones operating in complex urban environments. The authors propose SpatialFly, a framework that relies solely on RGB inputs and introduces a geometry-guided 2D representation alignment mechanism to inject global geometric priors into semantic features. By integrating cross-modal attention, the method implicitly aligns 2D semantics with 3D geometry without requiring explicit 3D reconstruction. Key innovations include geometric prior injection, geometry-aware reparameterization, and gated residual fusion, which collectively enhance spatial reasoning capabilities. Experiments demonstrate that SpatialFly outperforms state-of-the-art methods in both seen and unseen scenarios, reducing path error by 4.03 meters and improving success rate by 1.27% on the Full unseen test set, while generating smoother trajectories with better path alignment.
📝 Abstract
UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.