Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection

๐Ÿ“… 2025-11-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Infrared small target detection suffers from weak feature representation and severe background clutter, while existing text-guided methods rely heavily on manual annotations and suffer from inaccurate textual descriptions. To address these issues, this paper proposes DGSPNet, an end-to-end language-prompt-driven network. Its core contributions are: (1) a novel dual-granularity semantic prompting mechanism that jointly leverages coarse-grained textual priors and fine-grained visionโ€“language alignment, enabling annotation-free text guidance; and (2) text-guided channel attention (TGCA) and text-guided spatial attention (TGSA) modules that collaboratively enhance target sensitivity in both low-level texture and high-level semantic features. Extensive experiments demonstrate that DGSPNet achieves state-of-the-art performance on three benchmark datasets, significantly improving detection accuracy and robustness under complex background conditions.

Technology Category

Application Category

๐Ÿ“ Abstract
Infrared small target detection remains challenging due to limited feature representation and severe background interference, resulting in sub-optimal performance. While recent CLIP-inspired methods attempt to leverage textual guidance for detection, they are hindered by inaccurate text descriptions and reliance on manual annotations. To overcome these limitations, we propose DGSPNet, an end-to-end language prompt-driven framework. Our approach integrates dual-granularity semantic prompts: coarse-grained textual priors (e.g., 'infrared image', 'small target') and fine-grained personalized semantic descriptions derived through visual-to-textual mapping within the image space. This design not only facilitates learning fine-grained semantic information but also can inherently leverage language prompts during inference without relying on any annotation requirements. By fully leveraging the precision and conciseness of text descriptions, we further introduce a text-guide channel attention (TGCA) mechanism and text-guide spatial attention (TGSA) mechanism that enhances the model's sensitivity to potential targets across both low- and high-level feature spaces. Extensive experiments demonstrate that our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.
Problem

Research questions and friction points this paper is trying to address.

Improving infrared small target detection accuracy
Overcoming inaccurate text descriptions in detection methods
Reducing reliance on manual annotations for target detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-granularity semantic prompts for language guidance
Text-guide channel and spatial attention mechanisms
Visual-to-textual mapping for fine-grained semantic descriptions
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zixuan Wang
School of Automation, Northwestern Polytechnical University
H
Haoran Sun
School of Automation, Northwestern Polytechnical University
J
Jiaming Lu
School of Automation, Northwestern Polytechnical University
W
Wenxuan Wang
School of Automation, Northwestern Polytechnical University
Zhongling Huang
Zhongling Huang
School of Automation, Northwestern Polytechnical University
D
Dingwen Zhang
School of Automation, Northwestern Polytechnical University
Xuelin Qian
Xuelin Qian
Northwestern Polytechnical University
computer visionmachine learningmultimedia
J
Junwei Han
School of Automation, Northwestern Polytechnical University