🤖 AI Summary
Existing referring expression detection research is largely confined to ground-level images and struggles to address the challenges posed by aerial scenes, such as highly variable object scales, complex backgrounds, and dense distractors. To bridge this gap, this work introduces RefAerial, the first large-scale benchmark for referring expression detection in aerial imagery, along with a Scale-Comprehensive and Scale-Sensitive (SCS) detection framework. The SCS framework integrates a mixed-granularity attention mechanism with a two-stage “comprehensive-to-sensitive” decoding strategy to effectively model multi-scale targets. Furthermore, we develop REA-Engine, a human-in-the-loop semi-automatic annotation engine, to enhance data curation efficiency. Experiments demonstrate that SCS significantly outperforms existing methods on RefAerial and exhibits strong generalization capability on ground-level datasets.
📝 Abstract
Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.