Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) lack rigorous spatial intelligence evaluation for drone navigation and exhibit limited capabilities in dynamic environment understanding and navigation decision-making. To address this, we introduce SpatialSky-Bench—the first aerial-scene-oriented spatial intelligence benchmark—comprising 13 fine-grained tasks across two categories: environmental perception and scene understanding. Concurrently, we release SpatialSky-Dataset, a large-scale, multi-scenario dataset containing one million high-quality annotated samples. Methodologically, our approach integrates multi-granularity spatial reasoning, modular task modeling (e.g., distance estimation, altitude analysis, landing safety assessment), and end-to-end vision-language joint training. Experimental results reveal that mainstream VLMs underperform significantly on this benchmark; in contrast, Sky-VLM—a model specifically trained on SpatialSky-Bench—achieves state-of-the-art performance across all tasks, substantially enhancing spatial comprehension and autonomous navigation capabilities in complex airspace.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.
Problem

Research questions and friction points this paper is trying to address.

Evaluating spatial intelligence capabilities of Vision-Language Models for UAV navigation
Addressing performance gaps in VLMs for complex aerial navigation scenarios
Developing specialized models for spatial reasoning in unmanned aerial vehicles
Innovation

Methods, ideas, or system contributions that make the work stand out.

SpatialSky-Bench evaluates VLM spatial intelligence for UAVs
SpatialSky-Dataset provides 1M annotated samples for training
Sky-VLM model achieves state-of-the-art UAV navigation performance
🔎 Similar Papers
No similar papers found.
Lingfeng Zhang
Lingfeng Zhang
PhD student at Tsinghua University
embodied ai
Y
Yuchen Zhang
Xiaomi EV
H
Hongsheng Li
Tsinghua Shenzhen International Graduate School, Tsinghua University
H
Haoxiang Fu
National University of Singapore
Yingbo Tang
Yingbo Tang
Institute of Automation,Chinese Academy of Sciences
H
Hangjun Ye
Xiaomi EV
L
Long Chen
Xiaomi EV
X
Xiaojun Liang
Peng Cheng Laboratory
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
Wenbo Ding
Wenbo Ding
UNIVERSITY AT BUFFALO
securityMachine Learning