Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of detecting small objects in remote sensing imagery, which are often hindered by their limited scale, weak textures, and interference from complex backgrounds, leading to suboptimal performance of general-purpose detection algorithms. To overcome these limitations, this work proposes ESM-YOLO+, a lightweight visible-infrared fusion network that achieves cross-modal pixel-level alignment through Mask-Enhanced Attention Fusion (MEAF) and enhances feature discriminability via a Structure Representation (SR) augmentation strategy during training—without introducing additional inference overhead. Evaluated on the VEDAI and DroneVehicle datasets, the proposed method achieves mAP scores of 84.71% and 74.0%, respectively, while reducing model parameters by 93.6% and computational cost by 68.0%, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
Targets in remote sensing images are usually small, weakly textured, and easily disturbed by complex backgrounds, challenging high-precision detection with general algorithms. Building on our earlier ESM-YOLO, this work presents ESM-YOLO+ as a lightweight visible infrared fusion network. To enhance detection, ESM-YOLO+ includes two key innovations. (1) A Mask-Enhanced Attention Fusion (MEAF) module fuses features at the pixel level via learnable spatial masks and spatial attention, effectively aligning RGB and infrared features, enhancing small-target representation, and alleviating cross-modal misalignment and scale heterogeneity. (2) Training-time Structural Representation (SR) enhancement provides auxiliary supervision to preserve fine-grained spatial structures during training, boosting feature discriminability without extra inference cost. Extensive experiments on the VEDAI and DroneVehicle datasets validate ESM-YOLO+'s superiority. The model achieves 84.71\% mAP on VEDAI and 74.0\% mAP on DroneVehicle, while greatly reducing model complexity, with 93.6\% fewer parameters and 68.0\% lower GFLOPs than the baseline. These results confirm that ESM-YOLO+ integrates strong performance with practicality for real-time deployment, providing an effective solution for high-performance small-target detection in complex remote sensing scenes.
Problem

Research questions and friction points this paper is trying to address.

small target detection
remote sensing images
visible-infrared fusion
complex background
cross-modal misalignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-Enhanced Attention Fusion
Visible-Infrared Fusion
Small Target Detection
Structural Representation Enhancement
Lightweight Network
🔎 Similar Papers
No similar papers found.
Qianqian Zhang
Qianqian Zhang
Ph.D. Candidate, State University of New York at Binghamton
Machine LearningData ScienceOperations ResearchArtificial IntelligenceMedical Image Analystics
X
Xiaolong Jia
School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK
A
Ahmed M. Abdelmoniem
School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK
Li Zhou
Li Zhou
Institute of Software, Chinese Academy of Sciences
Quantum computingFormal Verification
J
Junshe An
National Space Science Center, Chinese Academy of Sciences, Beijing, 100190, China; School of Astronomy and Space Science, University of Chinese Academy of Sciences, Beijing, China