🤖 AI Summary
Existing CNN-Transformer hybrid architectures for multi-frame infrared small target detection (IRSTD) struggle to jointly model scale-sensitive local features and dynamic motion patterns. To address this, we propose LVNet—a lightweight and efficient hybrid network. Its key innovations are: (i) a multi-scale CNN-based front-end that replaces ViT’s linear embedding to explicitly capture low-level, scale-sensitive infrared small target features; and (ii) a U-shaped video Transformer architecture that enhances spatiotemporal dynamic perception. Evaluated on the IRDST and NUDT-MIRSDT benchmarks, LVNet achieves 5.63% and 18.36% higher nIoU than the state-of-the-art LMAFormer, respectively, while requiring only 1/221 the parameter count and 1/92–1/21 the computational cost. This demonstrates LVNet’s superior balance of detection accuracy, inference efficiency, and practical deployability.
📝 Abstract
Multi-frame infrared small target detection (IRSTD) plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD performance. In this paper, we propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning in hybrid frameworks for multi-frame IRSTD. Our key insight is that the standard linear patch embeddings in Vision Transformers are insufficient for capturing the scale-sensitive local features critical to infrared small targets. To address this limitation, we introduce a multi-scale CNN frontend that explicitly models local features by leveraging the local spatial bias of convolution. Additionally, we design a U-shaped video Transformer for multi-frame spatiotemporal context modeling, effectively capturing the motion characteristics of targets. Experiments on the publicly available datasets IRDST and NUDT-MIRSDT demonstrate that LVNet outperforms existing state-of-the-art methods. Notably, compared to the current best-performing method, LMAFormer, LVNet achieves an improvement of 5.63% / 18.36% in nIoU, while using only 1/221 of the parameters and 1/92 / 1/21 of the computational cost. Ablation studies further validate the importance of low-level representation learning in hybrid architectures. Our code and trained models are available at https://github.com/ZhihuaShen/LVNet.