AirSpatialBot: A Spatially Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognition and Retrieval

📅 2026-01-04
🏛️ IEEE Transactions on Geoscience and Remote Sensing
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited spatial reasoning capabilities of existing remote sensing vision-language models (VLMs), which struggle to effectively recognize and retrieve fine-grained attributes of vehicles in drone imagery. To bridge this gap, we introduce AirSpatial, the first remote sensing dataset annotated with 3D bounding boxes and comprising 206K spatially grounded instructions. We propose a two-stage training strategy—initial image understanding pretraining followed by spatial reasoning fine-tuning—and develop AirSpatialBot, an aerial agent endowed with task planning and joint visual-spatial reasoning abilities. Our work pioneers spatial localization and spatial question answering tasks in remote sensing, demonstrating superior performance in vehicle attribute recognition and retrieval. Experiments further reveal the spatial comprehension limitations of current VLMs. The model, code, and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Despite notable advancements in remote sensing (RS) vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in RS, we specifically address vehicle imagery captured by drones and introduce a spatially aware dataset AirSpatial, which comprises over 206k instructions and introduces two novel tasks: spatial grounding (SG) and spatial question answering (SQA). It is also the first RS grounding dataset to provide a 3-D bounding box (3DBB). To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two-stage training strategy comprising image understanding pretraining and spatial understanding fine-tuning. Utilizing this trained spatially aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at https://github.com/VisionXLab/AirSpatialBot
Problem

Research questions and friction points this paper is trying to address.

spatial understanding
vehicle attribute recognition
remote sensing
vision-language models
aerial imagery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatially-Aware VLM
3D Bounding Box (3DBB)
Spatial Grounding
Two-Stage Training
Aerial Agent
🔎 Similar Papers
No similar papers found.
Yue Zhou
Yue Zhou
Associate Professor, East China Normal University
Remote Sensing Vision-Language ModelOriented Object Detection
R
R. Ding
Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
X
Xue Yang
Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China
X
Xue Jiang
Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
X
Xingzhao Liu
Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China