๐ค AI Summary
Existing sliding-window-based multi-frame infrared small target detection methods neglect global temporal dependencies, leading to information loss, computational redundancy, and performance degradation. To address this, we propose a bidirectional temporal information propagation framework that recursively fuses local and global spatiotemporal featuresโfirst of its kind. Specifically, we design a Local Temporal Motion Fusion (LTMF) module to model short-term dynamics and a Global Temporal Motion Fusion (GTMF) module to capture long-range temporal dependencies. Furthermore, we introduce a Spatiotemporal Fusion Loss (STF Loss) to enable end-to-end joint optimization. Our approach eliminates reliance on fixed sliding windows, significantly improving detection accuracy and robustness for weak and small targets. Extensive experiments on multiple infrared video benchmarks demonstrate state-of-the-art performance while maintaining efficient inference speed.
๐ Abstract
Moving infrared small target detection is broadly adopted in infrared search and track systems, and has attracted considerable research focus in recent years. The existing learning-based multi-frame methods mainly aggregate the information of adjacent frames in a sliding window fashion to assist the detection of the current frame. However, the sliding-window-based methods do not consider joint optimization of the entire video clip and ignore the global temporal information outside the sliding window, resulting in redundant computation and sub-optimal performance. In this paper, we propose a Bidirectional temporal information propagation method for moving InfraRed small target Detection, dubbed BIRD. The bidirectional propagation strategy simultaneously utilizes local temporal information of adjacent frames and global temporal information of past and future frames in a recursive fashion. Specifically, in the forward and backward propagation branches, we first design a Local Temporal Motion Fusion (LTMF) module to model local spatio-temporal dependency between a target frame and its two adjacent frames. Then, a Global Temporal Motion Fusion (GTMF) module is developed to further aggregate the global propagation feature with the local fusion feature. Finally, the bidirectional aggregated features are fused and input into the detection head for detection. In addition, the entire video clip is jointly optimized by the traditional detection loss and the additional Spatio-Temporal Fusion (STF) loss. Extensive experiments demonstrate that the proposed BIRD method not only achieves the state-of-the-art performance but also shows a fast inference speed.