🤖 AI Summary
Infrared UAV target tracking suffers from low accuracy and poor robustness due to weak target appearance and complex backgrounds in thermal imagery. To address this, we propose a lightweight Siamese tracking framework tailored for infrared small targets. Our contributions are threefold: (1) a selective target enhancement network coupled with a dynamic spatial-channel feature aggregation module to strengthen representations of texture-poor targets; (2) intensity-aware multi-head cross-attention and spatial-attention-guided multi-scale feature enhancement to improve contextual modeling; and (3) a hierarchical target-aware knowledge distillation mechanism that transfers global contextual knowledge from a teacher model to the student without computational overhead. Evaluated on a real-world infrared UAV dataset, our method achieves real-time performance (>30 FPS) and significantly outperforms state-of-the-art approaches, improving mAP by 5.2%. Notably, it demonstrates markedly enhanced tracking stability under low-contrast and cluttered background conditions.
📝 Abstract
Unmanned aerial vehicle (UAV) target tracking based on thermal infrared imaging has been one of the most important sensing technologies in anti-UAV applications. However, the infrared UAV targets often exhibit weak features and complex backgrounds, posing significant challenges to accurate tracking. To address these problems, we introduce SiamDFF, a novel dynamic feature fusion Siamese network that integrates feature enhancement and global contextual attention knowledge distillation for infrared UAV target (IRUT) tracking. The SiamDFF incorporates a selective target enhancement network (STEN), a dynamic spatial feature aggregation module (DSFAM), and a dynamic channel feature aggregation module (DCFAM). The STEN employs intensity-aware multi-head cross-attention to adaptively enhance important regions for both template and search branches. The DSFAM enhances multi-scale UAV target features by integrating local details with global features, utilizing spatial attention guidance within the search frame. The DCFAM effectively integrates the mixed template generated from STEN in the template branch and original template, avoiding excessive background interference with the template and thereby enhancing the emphasis on UAV target region features within the search frame. Furthermore, to enhance the feature extraction capabilities of the network for IRUT without adding extra computational burden, we propose a novel tracking-specific target-aware contextual attention knowledge distiller. It transfers the target prior from the teacher network to the student model, significantly improving the student network's focus on informative regions at each hierarchical level of the backbone network. Extensive experiments on real infrared UAV datasets demonstrate that the proposed approach outperforms state-of-the-art target trackers under complex backgrounds while achieving a real-time tracking speed.