FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address key challenges in city-scale UAV vision-language navigation—including insufficient multimodal fusion, poor cross-environment generalization, and opaque decision-making—this paper proposes a two-stage training paradigm comprising supervised fine-tuning (SFT) and grouped relative policy optimization (GRPO), integrated with a chain-of-thought (CoT) reasoning mechanism for deep vision-language model (VLM) fusion. The approach jointly enhances navigation accuracy, environmental generalizability, and decision interpretability. Evaluated on the CityNav benchmark, it achieves state-of-the-art performance, improving success rate on unseen environments by 9.22% over the strongest baseline. This advancement delivers a more robust and trustworthy navigation framework for real-world applications such as disaster response, urban logistics delivery, and infrastructure inspection.

Technology Category

Application Category

📝 Abstract

Unmanned Aerial Vehicle (UAV) Vision-and-Language Navigation (VLN) is vital for applications such as disaster response, logistics delivery, and urban inspection. However, existing methods often struggle with insufficient multimodal fusion, weak generalization, and poor interpretability. To address these challenges, we propose FlightGPT, a novel UAV VLN framework built upon Vision-Language Models (VLMs) with powerful multimodal perception capabilities. We design a two-stage training pipeline: first, Supervised Fine-Tuning (SFT) using high-quality demonstrations to improve initialization and structured reasoning; then, Group Relative Policy Optimization (GRPO) algorithm, guided by a composite reward that considers goal accuracy, reasoning quality, and format compliance, to enhance generalization and adaptability. Furthermore, FlightGPT introduces a Chain-of-Thought (CoT)-based reasoning mechanism to improve decision interpretability. Extensive experiments on the city-scale dataset CityNav demonstrate that FlightGPT achieves state-of-the-art performance across all scenarios, with a 9.22% higher success rate than the strongest baseline in unseen environments. Our implementation is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Improving multimodal fusion in UAV vision-language navigation

Enhancing generalization and adaptability in VLN tasks

Increasing interpretability of decision-making in UAV navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language Models for multimodal fusion

Employs two-stage training with SFT and GRPO

Introduces Chain-of-Thought reasoning for interpretability

🔎 Similar Papers

No similar papers found.