Spatio-Temporal Context Learning with Temporal Difference Convolution for Moving Infrared Small Target Detection

πŸ“… 2025-11-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In infrared small target detection (IRSTD), low signal-to-noise ratio and strong clutter background impede effective spatiotemporal feature modeling. To address this, we propose TDCNetβ€”a novel network integrating temporal differencing and 3D convolution. Its key contributions are: (1) a temporally differenced convolution (TDC) reparameterization module that explicitly captures multi-scale motion cues across frames; and (2) a TDC-guided spatiotemporal attention mechanism that jointly suppresses background interference and enhances motion-sensitive feature responses. TDCNet adopts a parallel architecture combining TDC and 3D convolution, further enhanced by reparameterization and cross-modal attention for joint and dynamic spatiotemporal feature learning. Extensive experiments on IRSTD-UAV and multiple public infrared datasets demonstrate that our method achieves state-of-the-art detection accuracy and robustness, surpassing existing approaches and attaining internationally leading performance.

Technology Category

Application Category

πŸ“ Abstract
Moving infrared small target detection (IRSTD) plays a critical role in practical applications, such as surveillance of unmanned aerial vehicles (UAVs) and UAV-based search system. Moving IRSTD still remains highly challenging due to weak target features and complex background interference. Accurate spatio-temporal feature modeling is crucial for moving target detection, typically achieved through either temporal differences or spatio-temporal (3D) convolutions. Temporal difference can explicitly leverage motion cues but exhibits limited capability in extracting spatial features, whereas 3D convolution effectively represents spatio-temporal features yet lacks explicit awareness of motion dynamics along the temporal dimension. In this paper, we propose a novel moving IRSTD network (TDCNet), which effectively extracts and enhances spatio-temporal features for accurate target detection. Specifically, we introduce a novel temporal difference convolution (TDC) re-parameterization module that comprises three parallel TDC blocks designed to capture contextual dependencies across different temporal ranges. Each TDC block fuses temporal difference and 3D convolution into a unified spatio-temporal convolution representation. This re-parameterized module can effectively capture multi-scale motion contextual features while suppressing pseudo-motion clutter in complex backgrounds, significantly improving detection performance. Moreover, we propose a TDC-guided spatio-temporal attention mechanism that performs cross-attention between the spatio-temporal features from the TDC-based backbone and a parallel 3D backbone. This mechanism models their global semantic dependencies to refine the current frame's features. Extensive experiments on IRSTD-UAV and public infrared datasets demonstrate that our TDCNet achieves state-of-the-art detection performance in moving target detection.
Problem

Research questions and friction points this paper is trying to address.

Detecting moving infrared small targets with weak features
Overcoming complex background interference in IRSTD
Integrating motion cues and spatio-temporal feature representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal difference convolution re-parameterization module
Multi-scale motion contextual feature capture
TDC-guided spatio-temporal attention mechanism
Houzhang Fang
Houzhang Fang
School of Computer Science and Technology, Xidian University, China
S
Shukai Guo
School of Computer Science and Technology, Xidian University, China
Q
Qiuhuan Chen
School of Computer Science and Technology, Xidian University, China
Y
Yi Chang
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, China
Luxin Yan
Luxin Yan
Huazhong University of Science and Technology
Computer VisionImage ProcessingDeep Learning