UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited performance on low-altitude UAV visual-language understanding tasks, primarily due to the absence of domain-specific evaluation benchmarks and high-quality training data. To address this gap, this work proposes UAVBench—the first comprehensive evaluation benchmark based on real-world low-altitude UAV imagery, encompassing ten distinct task categories—and UAVIT-1M, a million-scale instruction-tuning dataset covering diverse weather conditions and image resolutions, with human-verified annotations to ensure label quality. Fine-tuning open-source MLLMs on UAVIT-1M significantly enhances their performance, substantially narrowing the gap with proprietary models on UAVBench and demonstrating the dataset’s effectiveness and domain adaptation value.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images. However, understanding low-altitude drone scenarios remains a challenge. Existing datasets primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs' abilities in low-altitude vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene images and about 2,000 types of spatial resolutions with 11 distinct tasks. UAVBench and UAVIT-1M feature pure real-world visual images and rich weather conditions, and involve manual verification to ensure high quality. Our in-depth analysis of 11 state-of-the-art MLLMs using UAVBench reveals that open-source MLLMs cannot generate accurate conversations about low-altitude visual content, lagging behind closed-source MLLMs. Extensive experiments demonstrate that fine-tuning open-source MLLMs on UAVIT-1M significantly addresses this gap. Our contributions pave the way for bridging the gap between current MLLMs and low-altitude UAV real-world application demands. (Project page: https://UAVBench.github.io/)
Problem

Research questions and friction points this paper is trying to address.

UAV
Multimodal Large Language Models
low-altitude vision
vision-language understanding
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

UAVBench
UAVIT-1M
Multimodal Large Language Models
low-altitude UAV vision
instruction tuning