Leveraging Language Prior for Infrared Small Target Detection

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Infrared small target detection (IRSTD) suffers from extremely small target sizes, sparse annotations, and reliance solely on unimodal visual features, limiting performance. To address this, we propose the first multimodal IRSTD framework integrating linguistic priors: (1) we construct LangIR, a novel image–text paired infrared dataset with precise spatial descriptions generated by GPT-4; (2) we design a language-guided attention mechanism enabling end-to-end joint modeling of textual and visual features. Our key contribution lies in incorporating large language models’ semantic understanding into IRSTD, thereby overcoming the bottleneck of purely vision-based approaches. Experiments on LangIR demonstrate significant improvements over prior methods on the NUAA-SIRST and IRSTD-1k subsets: IoU increases by 9.74% and 4.41%, probability of detection (Pd) rises by 1.25% and 2.01%, and false alarm (Fa) decreases sharply by 67.87% and 113.43%, respectively—validating the efficacy of linguistic priors in enhancing detection of sub-pixel targets.

Technology Category

Application Category

📝 Abstract

IRSTD (InfraRed Small Target Detection) detects small targets in infrared blurry backgrounds and is essential for various applications. The detection task is challenging due to the small size of the targets and their sparse distribution in infrared small target datasets. Although existing IRSTD methods and datasets have led to significant advancements, they are limited by their reliance solely on the image modality. Recent advances in deep learning and large vision-language models have shown remarkable performance in various visual recognition tasks. In this work, we propose a novel multimodal IRSTD framework that incorporates language priors to guide small target detection. We leverage language-guided attention weights derived from the language prior to enhance the model's ability for IRSTD, presenting a novel approach that combines textual information with image data to improve IRSTD capabilities. Utilizing the state-of-the-art GPT-4 vision model, we generate text descriptions that provide the locations of small targets in infrared images, employing careful prompt engineering to ensure improved accuracy. Due to the absence of multimodal IR datasets, existing IRSTD methods rely solely on image data. To address this shortcoming, we have curated a multimodal infrared dataset that includes both image and text modalities for small target detection, expanding upon the popular IRSTD-1k and NUDT-SIRST datasets. We validate the effectiveness of our approach through extensive experiments and comprehensive ablation studies. The results demonstrate significant improvements over the state-of-the-art method, with relative percentage differences of 9.74%, 13.02%, 1.25%, and 67.87% in IoU, nIoU, Pd, and Fa on the NUAA-SIRST subset, and 4.41%, 2.04%, 2.01%, and 113.43% on the IRSTD-1k subset of the LangIR dataset, respectively.

Problem

Research questions and friction points this paper is trying to address.

Detecting small targets in infrared blurry backgrounds

Overcoming limitations of image-only modality in IRSTD

Creating multimodal dataset for language-guided IRSTD

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal IRSTD framework with language priors

Language-guided attention weights enhance target detection

GPT-4 generates text descriptions for target locations

🔎 Similar Papers

No similar papers found.