Leveraging Language Prior for Infrared Small Target Detection

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Infrared small target detection (IRSTD) suffers from extremely small target sizes, sparse annotations, and reliance solely on unimodal visual features, limiting performance. To address this, we propose the first multimodal IRSTD framework integrating linguistic priors: (1) we construct LangIR, a novel image–text paired infrared dataset with precise spatial descriptions generated by GPT-4; (2) we design a language-guided attention mechanism enabling end-to-end joint modeling of textual and visual features. Our key contribution lies in incorporating large language models’ semantic understanding into IRSTD, thereby overcoming the bottleneck of purely vision-based approaches. Experiments on LangIR demonstrate significant improvements over prior methods on the NUAA-SIRST and IRSTD-1k subsets: IoU increases by 9.74% and 4.41%, probability of detection (Pd) rises by 1.25% and 2.01%, and false alarm (Fa) decreases sharply by 67.87% and 113.43%, respectively—validating the efficacy of linguistic priors in enhancing detection of sub-pixel targets.

Technology Category

Application Category

📝 Abstract
IRSTD (InfraRed Small Target Detection) detects small targets in infrared blurry backgrounds and is essential for various applications. The detection task is challenging due to the small size of the targets and their sparse distribution in infrared small target datasets. Although existing IRSTD methods and datasets have led to significant advancements, they are limited by their reliance solely on the image modality. Recent advances in deep learning and large vision-language models have shown remarkable performance in various visual recognition tasks. In this work, we propose a novel multimodal IRSTD framework that incorporates language priors to guide small target detection. We leverage language-guided attention weights derived from the language prior to enhance the model's ability for IRSTD, presenting a novel approach that combines textual information with image data to improve IRSTD capabilities. Utilizing the state-of-the-art GPT-4 vision model, we generate text descriptions that provide the locations of small targets in infrared images, employing careful prompt engineering to ensure improved accuracy. Due to the absence of multimodal IR datasets, existing IRSTD methods rely solely on image data. To address this shortcoming, we have curated a multimodal infrared dataset that includes both image and text modalities for small target detection, expanding upon the popular IRSTD-1k and NUDT-SIRST datasets. We validate the effectiveness of our approach through extensive experiments and comprehensive ablation studies. The results demonstrate significant improvements over the state-of-the-art method, with relative percentage differences of 9.74%, 13.02%, 1.25%, and 67.87% in IoU, nIoU, Pd, and Fa on the NUAA-SIRST subset, and 4.41%, 2.04%, 2.01%, and 113.43% on the IRSTD-1k subset of the LangIR dataset, respectively.
Problem

Research questions and friction points this paper is trying to address.

Detecting small targets in infrared blurry backgrounds
Overcoming limitations of image-only modality in IRSTD
Creating multimodal dataset for language-guided IRSTD
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal IRSTD framework with language priors
Language-guided attention weights enhance target detection
GPT-4 generates text descriptions for target locations
🔎 Similar Papers
No similar papers found.