MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?

šŸ“… 2025-12-29
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Existing evaluation benchmarks inadequately address the unique challenges of low-altitude unmanned aerial vehicle (UAV) scenarios and lack systematic assessment of multimodal large language models’ (MLLMs) cross-level intelligent capabilities—perception, cognition, and planning. Method: We introduce UAV-Bench, the first comprehensive multimodal benchmark tailored to low-altitude UAV applications. It is built upon 5.7K human-annotated real-world UAV images and videos, covering 19 diverse subtasks and unifying perception–cognition–planning intelligence for the first time. Contribution/Results: UAV-Bench identifies critical bottlenecks—including spatial bias and multi-view understanding—and empirically demonstrates widespread inadequacy of 16 state-of-the-art MLLMs in low-altitude adaptation. Spatial modeling fidelity and cross-view consistency are identified as primary determinants of performance. The benchmark is fully open-sourced, establishing a standardized evaluation infrastructure to advance robust, UAV-specific intelligent model development.

Technology Category

Application Category

šŸ“ Abstract
While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs'general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.
Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' perception, cognition, and planning in low-altitude UAV scenarios
Addresses the lack of a unified benchmark for UAV-specific MLLM intelligence
Identifies bottlenecks like spatial bias in MLLMs for real-world UAV applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces a comprehensive benchmark for UAV scenarios
Evaluates perception, cognition, and planning capabilities
Uses real-world UAV data with manually annotated questions
šŸ”Ž Similar Papers
No similar papers found.