π€ AI Summary
Existing feature pyramid networks struggle to effectively model multi-scale discriminative features for dense visual prediction, particularly underperforming on small objects. This work proposes A3-FPN, a novel architecture that enables progressive decoupling for global feature interaction and incorporates a content-aware attention mechanism to enhance feature representation. During fusion and recombination stages, the method employs context-aware resampling and an information-driven redundancy optimization strategy, respectively, achieving efficient feature reassembly through positional offsets and content-adaptive weights. A3-FPN is compatible with both CNN and Transformer backbones and demonstrates significant performance gains across multiple benchmarks: it achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes when paired with OneFormer and Swin-L backbones, and also shows strong results on VisDrone2019-DET.
π Abstract
Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.