🤖 AI Summary
Existing open-world aerial object detection methods are constrained by small-scale, coarse-grained textual annotations (word-level only), hindering fine-grained semantic understanding. To address this, we introduce MI-OAD—the first large-scale, multi-instance language-guided aerial detection dataset—comprising 163,023 images and 2 million image–text pairs, 40× larger than prior benchmarks. We propose a three-tier fine-grained text supervision paradigm (word → phrase → sentence) and design OS-W2S Label Engine, an automated annotation framework integrating large vision-language models, geometric image augmentation, and BERT-based semantic post-processing. Our approach enables zero-shot transfer on open-set detectors such as Grounding DINO: with sentence-level inputs, it achieves +29.5 AP₅₀ and +33.7 Recall@10, substantially advancing the state of the art in remote sensing vision–language grounding by overcoming both data-scale and semantic-granularity bottlenecks.
📝 Abstract
In recent years, language-guided open-world aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary, which fails to meet the demands of more fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the OS-W2S Label Engine, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial detection datasets with rich textual annotations and construct a novel benchmark dataset, called Multi-instance Open-set Aerial Dataset (MI-OAD), addressing the limitations of current remote sensing grounding data and enabling effective open-set aerial detection. Specifically, MI-OAD contains 163,023 images and 2 million image-caption pairs, approximately 40 times larger than comparable datasets. We also employ state-of-the-art open-set methods from the natural image domain, trained on our proposed dataset, to validate the model's open-set detection capabilities. For instance, when trained on our dataset, Grounding DINO achieves improvements of 29.5 AP_{50} and 33.7 Recall@10 for sentence inputs under zero-shot transfer conditions. Both the dataset and the label engine will be released publicly.