Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic misalignment in vision-language models (VLMs) caused by target confusion, scale variation, and complex backgrounds in drone imagery, this paper proposes AerialVP—a novel prompt-enhanced agent framework tailored for aerial visual perception. It employs a three-stage mechanism—task analysis, tool selection, and dynamic prompt generation—to automatically inject multi-dimensional auxiliary information, thereby strengthening task-image semantic alignment. Additionally, we introduce AerialSense, the first comprehensive benchmark covering diverse tasks and real-world aerial scenarios. By integrating automated prompt engineering, multimodal semantic alignment, and scalable tool library invocation, AerialVP delivers significant and consistent performance gains on both open- and closed-source VLMs. Extensive evaluation confirms its strong generalization across varying resolutions, lighting conditions, and urban/rural environments.

Technology Category

Application Category

📝 Abstract
Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.
Problem

Research questions and friction points this paper is trying to address.

Enhances UAV image perception by improving task prompts.
Addresses target confusion, scale variations, and complex backgrounds.
Introduces a framework for multi-dimensional auxiliary information extraction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent framework AerialVP enhances UAV image task prompts
Extracts multi-dimensional auxiliary information from complex UAV imagery
Introduces benchmark AerialSense for standardized performance evaluation
🔎 Similar Papers
No similar papers found.
M
Mingning Guo
School of Geosciences and InfoPhysics, Central South University, Changsha 410083, China
M
Mengwei Wu
School of Geosciences and InfoPhysics, Central South University, Changsha 410083, China
S
Shaoxian Li
School of Geosciences and InfoPhysics, Central South University, Changsha 410083, China
Haifeng Li
Haifeng Li
Central South University
GISRemote sensingMachine learningSparse represetationBrain Theory
C
Chao Tao
School of Geosciences and InfoPhysics, Central South University, Changsha 410083, China