🤖 AI Summary
This work addresses the challenges of noisy pseudo-labels and unstable training in point-supervised infrared small target detection, which stem from insufficient semantic representation in lightweight CNNs. To overcome these limitations, the authors propose a knowledge distillation framework based on hierarchical vision foundation models. The approach formulates point supervision as a bilevel optimization process: an inner loop adapts the teacher model via cluster-level sample reweighting, while an outer loop transfers knowledge to the student model under validation guidance. A Semantic Conditional Affine Modulation (SCAM) module is introduced to inject semantic information across multiple layers. Combined with dynamic co-learning and a frozen pre-trained teacher, the method effectively suppresses pseudo-label noise and training bias, significantly improving detection accuracy and training stability across diverse backbone architectures and complex scenarios.
📝 Abstract
Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers. In addition, a dynamic collaborative learning strategy with cluster-level sample reweighting enhances robustness to imperfect pseudo-masks. Experiments on diverse challenging cases across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability. Our code is available at https://github.com/yuanhang-yao/semantic-prior.