🤖 AI Summary
To address the challenge of detecting distant, small-sized, and low-contrast targets in infrared imagery under complex backgrounds, this paper proposes YOLO-MST, an end-to-end deep learning framework. Methodologically, it integrates super-resolution preprocessing with multi-scale deep learning and introduces a novel Multi-Scale Feature Adaptation (MSFA) module. The architecture reconstructs the YOLOv5 backbone and neck to enable adaptive cross-scale feature fusion and incorporates a dynamic multi-scale detection head. Experimental results demonstrate state-of-the-art performance: mAP@0.5 reaches 96.4% on SIRST and 99.5% on IRIS—significantly outperforming existing methods. YOLO-MST substantially reduces both miss detection and false alarm rates, while exhibiting superior robustness, detection accuracy, and generalization capability across diverse infrared scenarios.
📝 Abstract
With the advancement of aerospace technology and the increasing demands of military applications, the development of low false-alarm and high-precision infrared small target detection algorithms has emerged as a key focus of research globally. However, the traditional model-driven method is not robust enough when dealing with features such as noise, target size, and contrast. The existing deep-learning methods have limited ability to extract and fuse key features, and it is difficult to achieve high-precision detection in complex backgrounds and when target features are not obvious. To solve these problems, this paper proposes a deep-learning infrared small target detection method that combines image super-resolution technology with multi-scale observation. First, the input infrared images are preprocessed with super-resolution and multiple data enhancements are performed. Secondly, based on the YOLOv5 model, we proposed a new deep-learning network named YOLO-MST. This network includes replacing the SPPF module with the self-designed MSFA module in the backbone, optimizing the neck, and finally adding a multi-scale dynamic detection head to the prediction head. By dynamically fusing features from different scales, the detection head can better adapt to complex scenes. The mAP@0.5 detection rates of this method on two public datasets, SIRST and IRIS, reached 96.4% and 99.5% respectively, more effectively solving the problems of missed detection, false alarms, and low precision.