🤖 AI Summary
Infrared small target detection poses significant challenges for direct application of vision foundation models due to weak radiation, lack of semantic content, and inconsistent treatment of single-frame versus multi-frame scenarios. To address these issues, this work proposes SPIRIT, a unified framework that leverages lightweight physics-informed plugins for both single-frame and video-based detection. SPIRIT incorporates a rank-sparse decomposition module to suppress background clutter and enhance target signatures, along with a temporal soft spatial prior to model inter-frame associations. By integrating physics-informed feature reconstruction (PIFR) and a prior-guided memory attention mechanism (PGMA), the framework effectively bridges vision foundation models with the intrinsic characteristics of infrared imaging. Extensive experiments demonstrate that SPIRIT achieves state-of-the-art performance across multiple infrared small target detection benchmarks, significantly outperforming existing methods.
📝 Abstract
Infrared small target detection (IRSTD) is crucial for surveillance and early-warning, with deployments spanning both single-frame analysis and video-mode tracking. A practical solution should leverage vision foundation models (VFMs) to mitigate infrared data scarcity, while adopting a memory-attention-based temporal propagation framework that unifies single- and multi-frame inference. However, infrared small targets exhibit weak radiometric signals and limited semantic cues, which differ markedly from visible-spectrum imagery. This modality gap makes direct use of semantics-oriented VFMs and appearance-driven cross-frame association unreliable for IRSTD: hierarchical feature aggregation can submerge localized target peaks, and appearance-only memory attention becomes ambiguous, leading to spurious clutter associations. To address these challenges, we propose SPIRIT, a unified and VFM-compatible framework that adapts VFMs to IRSTD via lightweight physics-informed plug-ins. Spatially, PIFR refines features by approximating rank-sparsity decomposition to suppress structured background components and enhance sparse target-like signals. Temporally, PGMA injects history-derived soft spatial priors into memory cross-attention to constrain cross-frame association, enabling robust video detection while naturally reverting to single-frame inference when temporal context is absent. Experiments on multiple IRSTD benchmarks show consistent gains over VFM-based baselines and SOTA performance.