See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This paper proposes a training-free, general-purpose vision-language navigation framework for unmanned aerial vehicles (UAVs), addressing 3D target navigation across diverse environments under free-form natural language instructions. Methodologically, the navigation task is decomposed into three stages: (1) instruction parsing and 2D waypoint localization via a vision-language model (VLM); (2) adaptive, distance-aware 2D-to-3D displacement mapping; and (3) closed-loop control for dynamic target following. The key contribution lies in formulating action prediction as a 2D spatial localization problem—bypassing end-to-end policy learning—and thereby significantly enhancing cross-VLM generalizability and deployment robustness. Evaluated on a DRL-based simulation benchmark, the approach achieves a 63% improvement over state-of-the-art methods; it also substantially outperforms multiple strong baselines in real-world experiments.

Technology Category

Application Category

📝 Abstract

We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harnesses VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs. Project page: https://spf-web.pages.dev

Problem

Research questions and friction points this paper is trying to address.

Enables UAV navigation using free-form language instructions without training

Converts language commands into 2D waypoints for spatial grounding

Provides closed-loop control for dynamic targets in changing environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free VLM framework for aerial navigation

Converts 2D waypoints to 3D displacement vectors

Closed-loop control for dynamic target following

🔎 Similar Papers

No similar papers found.