🤖 AI Summary
To address the limited multi-scale feature extraction and dynamic fusion capabilities of Transformer-based models in chest X-ray pneumonia detection, this paper proposes a lightweight real-time detection framework. Methodologically, it introduces three key innovations: (1) a cross-gated fusion mechanism for adaptive inter-scale feature interaction; (2) novel modules—XFABlock, SPGA, and GCFC3—that jointly enhance multi-scale representation learning and efficient information aggregation; and (3) a synergistic integration of convolutional attention, Cross-Stage Partial (CSP) architecture, single-head self-attention, and structural reparameterization across the backbone, neck, and detection head. Evaluated on the RSNA dataset, the model achieves mAP@0.5 = 82.2% (+3.7% over baseline), mAP@[0.5:0.95] = 50.4%, and an inference speed of 48.1 FPS—demonstrating a significant balance between detection accuracy and real-time performance.
📝 Abstract
Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate and efficient automated detection systems. While recent transformer-based detectors like RT-DETR have shown promise in object detection tasks, their application to medical imaging, particularly pneumonia detection in chest X-rays, remains underexplored. This paper presents CGF-DETR, an enhanced real-time detection transformer specifically designed for pneumonia detection. We introduce XFABlock in the backbone to improve multi-scale feature extraction through convolutional attention mechanisms integrated with CSP architecture. To achieve efficient feature aggregation, we propose SPGA module that replaces standard multi-head attention with dynamic gating mechanisms and single-head self-attention. Additionally, GCFC3 is designed for the neck to enhance feature representation through multi-path convolution fusion while maintaining real-time performance via structural re-parameterization. Extensive experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while maintaining comparable inference speed at 48.1 FPS. Our ablation studies confirm that each proposed module contributes meaningfully to the overall performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]