SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation

πŸ“… 2023-10-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing object detection and segmentation methods rely heavily on multi-scale feature pyramids, leading to architectural complexity and inefficient inference. To address this, we propose SimPLRβ€”the first pure Transformer framework that completely eliminates feature pyramids and operates exclusively on a single-scale feature representation. Its core innovation lies in internalizing multi-scale priors directly into the attention mechanism via a novel Scale-aware Attention module, coupled with a single-scale Vision Transformer backbone and a unified head for detection, instance segmentation, and panoptic segmentation. Furthermore, SimPLR leverages self-supervised pretraining to enhance generalization. Experiments demonstrate that SimPLR consistently surpasses state-of-the-art multi-scale methods on COCO and ADE20K while achieving significantly faster inference. Notably, performance gains scale favorably with increased model capacity and pretraining data volume, empirically validating the scalability and efficiency of the single-scale architecture.
πŸ“ Abstract
The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation, as well as panoptic segmentation. Code is released at https://github.com/kienduynguyen/SimPLR.
Problem

Research questions and friction points this paper is trying to address.

Simplifies object detection and segmentation architectures.
Introduces scale-aware attention for single-scale feature processing.
Improves accuracy and runtime efficiency in vision tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-aware attention mechanism replaces multi-scale feature maps.
Non-hierarchical backbone and detection head operate on single-scale features.
Competitive accuracy and faster runtime with self-supervised pre-training.
πŸ”Ž Similar Papers
No similar papers found.