🤖 AI Summary
To address the dual challenges of low detection accuracy and efficiency caused by extreme object sparsity and severe scale variation in high-resolution wide-angle (HRW) images, this paper proposes a model-agnostic sparse Vision Transformer architecture. Our method introduces three key innovations: (1) a selective token activation mechanism that models only candidate object-containing windows, drastically reducing computational redundancy; (2) cross-slice non-maximum suppression (C-NMS) to mitigate boundary artifacts induced by image tiling; and (3) coarse-to-fine attention collaboration with multi-scale feature fusion to enhance small-object perception. Evaluated on PANDA and DOTA-v1.0 benchmarks, our approach achieves up to a 5.8% improvement in mean Average Precision (mAP) while attaining inference speed three times faster than current state-of-the-art methods. To the best of our knowledge, this is the first work to achieve both high accuracy and high efficiency for sparse object detection in HRW scenarios.
📝 Abstract
Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.