Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing CNN-Transformer hybrid architectures for multi-frame infrared small target detection (IRSTD) struggle to jointly model scale-sensitive local features and dynamic motion patterns. To address this, we propose LVNet—a lightweight and efficient hybrid network. Its key innovations are: (i) a multi-scale CNN-based front-end that replaces ViT’s linear embedding to explicitly capture low-level, scale-sensitive infrared small target features; and (ii) a U-shaped video Transformer architecture that enhances spatiotemporal dynamic perception. Evaluated on the IRDST and NUDT-MIRSDT benchmarks, LVNet achieves 5.63% and 18.36% higher nIoU than the state-of-the-art LMAFormer, respectively, while requiring only 1/221 the parameter count and 1/92–1/21 the computational cost. This demonstrates LVNet’s superior balance of detection accuracy, inference efficiency, and practical deployability.

Technology Category

Application Category

📝 Abstract
Multi-frame infrared small target detection (IRSTD) plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD performance. In this paper, we propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning in hybrid frameworks for multi-frame IRSTD. Our key insight is that the standard linear patch embeddings in Vision Transformers are insufficient for capturing the scale-sensitive local features critical to infrared small targets. To address this limitation, we introduce a multi-scale CNN frontend that explicitly models local features by leveraging the local spatial bias of convolution. Additionally, we design a U-shaped video Transformer for multi-frame spatiotemporal context modeling, effectively capturing the motion characteristics of targets. Experiments on the publicly available datasets IRDST and NUDT-MIRSDT demonstrate that LVNet outperforms existing state-of-the-art methods. Notably, compared to the current best-performing method, LMAFormer, LVNet achieves an improvement of 5.63% / 18.36% in nIoU, while using only 1/221 of the parameters and 1/92 / 1/21 of the computational cost. Ablation studies further validate the importance of low-level representation learning in hybrid architectures. Our code and trained models are available at https://github.com/ZhihuaShen/LVNet.
Problem

Research questions and friction points this paper is trying to address.

Enhance multi-frame infrared small target detection accuracy.
Improve low-level feature learning in hybrid architectures.
Reduce computational cost and parameters in detection models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines CNNs and Transformers for IRSTD
Introduces multi-scale CNN for local features
Uses U-shaped video Transformer for motion
🔎 Similar Papers
No similar papers found.