Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

๐Ÿ“… 2026-03-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

206K/year
๐Ÿค– AI Summary
This study addresses the challenge of detecting small objects in remote sensing imagery, which are often hindered by their limited scale, weak textures, and interference from complex backgrounds, leading to suboptimal performance of general-purpose detection algorithms. To overcome these limitations, this work proposes ESM-YOLO+, a lightweight visible-infrared fusion network that achieves cross-modal pixel-level alignment through Mask-Enhanced Attention Fusion (MEAF) and enhances feature discriminability via a Structure Representation (SR) augmentation strategy during trainingโ€”without introducing additional inference overhead. Evaluated on the VEDAI and DroneVehicle datasets, the proposed method achieves mAP scores of 84.71% and 74.0%, respectively, while reducing model parameters by 93.6% and computational cost by 68.0%, significantly outperforming existing approaches.

Technology Category

Application Category

๐Ÿ“ Abstract
Targets in remote sensing images are usually small, weakly textured, and easily disturbed by complex backgrounds, challenging high-precision detection with general algorithms. Building on our earlier ESM-YOLO, this work presents ESM-YOLO+ as a lightweight visible infrared fusion network. To enhance detection, ESM-YOLO+ includes two key innovations. (1) A Mask-Enhanced Attention Fusion (MEAF) module fuses features at the pixel level via learnable spatial masks and spatial attention, effectively aligning RGB and infrared features, enhancing small-target representation, and alleviating cross-modal misalignment and scale heterogeneity. (2) Training-time Structural Representation (SR) enhancement provides auxiliary supervision to preserve fine-grained spatial structures during training, boosting feature discriminability without extra inference cost. Extensive experiments on the VEDAI and DroneVehicle datasets validate ESM-YOLO+'s superiority. The model achieves 84.71\% mAP on VEDAI and 74.0\% mAP on DroneVehicle, while greatly reducing model complexity, with 93.6\% fewer parameters and 68.0\% lower GFLOPs than the baseline. These results confirm that ESM-YOLO+ integrates strong performance with practicality for real-time deployment, providing an effective solution for high-performance small-target detection in complex remote sensing scenes.
Problem

Research questions and friction points this paper is trying to address.

small target detection
remote sensing images
visible-infrared fusion
complex background
cross-modal misalignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-Enhanced Attention Fusion
Visible-Infrared Fusion
Small Target Detection
Structural Representation Enhancement
Lightweight Network