Pyramid Sparse Transformer: Enhancing Multi-Scale Feature Fusion with Dynamic Token Selection

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the high computational cost of attention mechanisms and the difficulty in balancing spatial fidelity with efficiency under resource-constrained settings, this paper proposes the Pyramid Sparse Transformer (PST)—a lightweight, plug-and-play multi-scale feature fusion module. PST introduces a coarse-to-fine dynamic token selection strategy coupled with shared attention parameters, and pioneers a two-stage collaborative mechanism: training employs only coarse-grained attention, while fine-grained enhancement is seamlessly activated during inference—without retraining—yielding immediate performance gains. Designed for hardware efficiency and modularity, PST integrates effortlessly into existing architectures. On YOLOv11 variants, it improves COCO mAP by 0.4–0.9%; on ImageNet, it boosts top-1 accuracy by 6.5%, 1.7%, and 1.0% for ResNet-18, -50, and -101, respectively, with negligible increase in inference latency.

Technology Category

Application Category

📝 Abstract

Feature fusion is critical for high-performance vision models but often incurs prohibitive complexity. However, prevailing attention-based fusion methods often involve significant computational complexity and implementation challenges, limiting their efficiency in resource-constrained environments. To address these issues, we introduce the Pyramid Sparse Transformer (PST), a lightweight, plug-and-play module that integrates coarse-to-fine token selection and shared attention parameters to reduce computation while preserving spatial detail. PST can be trained using only coarse attention and seamlessly activated at inference for further accuracy gains without retraining. When added to state-of-the-art real-time detection models, such as YOLOv11-N/S/M, PST yields mAP improvements of 0.9%, 0.5%, and 0.4% on MS COCO with minimal latency impact. Likewise, embedding PST into ResNet-18/50/101 as backbones, boosts ImageNet top-1 accuracy by 6.5%, 1.7%, and 1.0%, respectively. These results demonstrate PST's effectiveness as a simple, hardware-friendly enhancement for both detection and classification tasks.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational complexity in feature fusion

Enhancing multi-scale feature fusion efficiency

Improving vision model performance with minimal latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pyramid Sparse Transformer for multi-scale fusion

Dynamic token selection reduces computation complexity

Plug-and-play module enhances accuracy without retraining

🔎 Similar Papers

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation

2023-10-09Citations: 0

Authors to Follow