AirSpatialBot: A Spatially Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognition and Retrieval

📅 2026-01-04

🏛️ IEEE Transactions on Geoscience and Remote Sensing

📈 Citations: 1

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the limited spatial reasoning capabilities of existing remote sensing vision-language models (VLMs), which struggle to effectively recognize and retrieve fine-grained attributes of vehicles in drone imagery. To bridge this gap, we introduce AirSpatial, the first remote sensing dataset annotated with 3D bounding boxes and comprising 206K spatially grounded instructions. We propose a two-stage training strategy—initial image understanding pretraining followed by spatial reasoning fine-tuning—and develop AirSpatialBot, an aerial agent endowed with task planning and joint visual-spatial reasoning abilities. Our work pioneers spatial localization and spatial question answering tasks in remote sensing, demonstrating superior performance in vehicle attribute recognition and retrieval. Experiments further reveal the spatial comprehension limitations of current VLMs. The model, code, and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Despite notable advancements in remote sensing (RS) vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in RS, we specifically address vehicle imagery captured by drones and introduce a spatially aware dataset AirSpatial, which comprises over 206k instructions and introduces two novel tasks: spatial grounding (SG) and spatial question answering (SQA). It is also the first RS grounding dataset to provide a 3-D bounding box (3DBB). To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two-stage training strategy comprising image understanding pretraining and spatial understanding fine-tuning. Utilizing this trained spatially aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at https://github.com/VisionXLab/AirSpatialBot

Problem

Research questions and friction points this paper is trying to address.

spatial understanding

vehicle attribute recognition

remote sensing

vision-language models

aerial imagery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially-Aware VLM

3D Bounding Box (3DBB)

Spatial Grounding