π€ AI Summary
Existing object detection and segmentation methods rely heavily on multi-scale feature pyramids, leading to architectural complexity and inefficient inference. To address this, we propose SimPLRβthe first pure Transformer framework that completely eliminates feature pyramids and operates exclusively on a single-scale feature representation. Its core innovation lies in internalizing multi-scale priors directly into the attention mechanism via a novel Scale-aware Attention module, coupled with a single-scale Vision Transformer backbone and a unified head for detection, instance segmentation, and panoptic segmentation. Furthermore, SimPLR leverages self-supervised pretraining to enhance generalization. Experiments demonstrate that SimPLR consistently surpasses state-of-the-art multi-scale methods on COCO and ADE20K while achieving significantly faster inference. Notably, performance gains scale favorably with increased model capacity and pretraining data volume, empirically validating the scalability and efficiency of the single-scale architecture.
π Abstract
The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation, as well as panoptic segmentation. Code is released at https://github.com/kienduynguyen/SimPLR.