SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Infrared small target detection poses significant challenges for direct application of vision foundation models due to weak radiation, lack of semantic content, and inconsistent treatment of single-frame versus multi-frame scenarios. To address these issues, this work proposes SPIRIT, a unified framework that leverages lightweight physics-informed plugins for both single-frame and video-based detection. SPIRIT incorporates a rank-sparse decomposition module to suppress background clutter and enhance target signatures, along with a temporal soft spatial prior to model inter-frame associations. By integrating physics-informed feature reconstruction (PIFR) and a prior-guided memory attention mechanism (PGMA), the framework effectively bridges vision foundation models with the intrinsic characteristics of infrared imaging. Extensive experiments demonstrate that SPIRIT achieves state-of-the-art performance across multiple infrared small target detection benchmarks, significantly outperforming existing methods.

Technology Category

Application Category

📝 Abstract

Infrared small target detection (IRSTD) is crucial for surveillance and early-warning, with deployments spanning both single-frame analysis and video-mode tracking. A practical solution should leverage vision foundation models (VFMs) to mitigate infrared data scarcity, while adopting a memory-attention-based temporal propagation framework that unifies single- and multi-frame inference. However, infrared small targets exhibit weak radiometric signals and limited semantic cues, which differ markedly from visible-spectrum imagery. This modality gap makes direct use of semantics-oriented VFMs and appearance-driven cross-frame association unreliable for IRSTD: hierarchical feature aggregation can submerge localized target peaks, and appearance-only memory attention becomes ambiguous, leading to spurious clutter associations. To address these challenges, we propose SPIRIT, a unified and VFM-compatible framework that adapts VFMs to IRSTD via lightweight physics-informed plug-ins. Spatially, PIFR refines features by approximating rank-sparsity decomposition to suppress structured background components and enhance sparse target-like signals. Temporally, PGMA injects history-derived soft spatial priors into memory cross-attention to constrain cross-frame association, enabling robust video detection while naturally reverting to single-frame inference when temporal context is absent. Experiments on multiple IRSTD benchmarks show consistent gains over VFM-based baselines and SOTA performance.

Problem

Research questions and friction points this paper is trying to address.

Infrared small target detection

Vision foundation models

Modality gap

Weak radiometric signals

Cross-frame association

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision foundation models

infrared small target detection

physics-informed plug-in