From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-world aerial object detection methods are constrained by small-scale, coarse-grained textual annotations (word-level only), hindering fine-grained semantic understanding. To address this, we introduce MI-OAD—the first large-scale, multi-instance language-guided aerial detection dataset—comprising 163,023 images and 2 million image–text pairs, 40× larger than prior benchmarks. We propose a three-tier fine-grained text supervision paradigm (word → phrase → sentence) and design OS-W2S Label Engine, an automated annotation framework integrating large vision-language models, geometric image augmentation, and BERT-based semantic post-processing. Our approach enables zero-shot transfer on open-set detectors such as Grounding DINO: with sentence-level inputs, it achieves +29.5 AP₅₀ and +33.7 Recall@10, substantially advancing the state of the art in remote sensing vision–language grounding by overcoming both data-scale and semantic-granularity bottlenecks.

Technology Category

Application Category

📝 Abstract
In recent years, language-guided open-world aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary, which fails to meet the demands of more fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the OS-W2S Label Engine, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial detection datasets with rich textual annotations and construct a novel benchmark dataset, called Multi-instance Open-set Aerial Dataset (MI-OAD), addressing the limitations of current remote sensing grounding data and enabling effective open-set aerial detection. Specifically, MI-OAD contains 163,023 images and 2 million image-caption pairs, approximately 40 times larger than comparable datasets. We also employ state-of-the-art open-set methods from the natural image domain, trained on our proposed dataset, to validate the model's open-set detection capabilities. For instance, when trained on our dataset, Grounding DINO achieves improvements of 29.5 AP_{50} and 33.7 Recall@10 for sentence inputs under zero-shot transfer conditions. Both the dataset and the label engine will be released publicly.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale datasets for language-guided aerial detection
Limited fine-grained open-world detection capabilities
Need for automatic annotation of diverse aerial scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale language-guided open-set aerial dataset
OS-W2S Label Engine for automatic annotation
Integration of vision-language model with preprocessing
🔎 Similar Papers
No similar papers found.
G
Guoting Wei
Nanjing University of Science and Technology, Intellifusion Inc.
Y
Yu Liu
Zhejiang Lab
X
Xia Yuan
Nanjing University of Science and Technology
Xizhe Xue
Xizhe Xue
Technical University of Munich
AI4EOVLMAerial object trackingHSI classificationOpen world vision
L
Linlin Guo
Beijing University of Posts and Telecommunications
Y
Yifan Yang
Intellifusion Inc.
C
Chunxia Zhao
Nanjing University of Science and Technology
Z
Zongwen Bai
Yan’an University
Haokui Zhang
Haokui Zhang
Northwestern Polytechnical University
Approximate nearest neighbor searchneural architecture searchdepth estimationHSI classificaion
R
Rong Xiao
Intellifusion Inc.