🤖 AI Summary
Existing open-vocabulary object detection methods struggle to generalize to thermal imaging scenarios characterized by sparse textures and distinct radiometric properties. This work proposes the first open-vocabulary object detection framework tailored for thermal imagery, leveraging synthetically generated million-scale thermal image–text pairs for training. The approach employs a frozen RGB teacher model to provide cross-modal pseudo-supervision, jointly optimizing detection, image captioning, and cross-modal distillation objectives. It introduces a novel thermal–text alignment head and a modality-fused cross-attention mechanism, enabling language-guided thermal detection and cross-modal knowledge transfer without manual annotations. Evaluated on public benchmarks, the method consistently improves average precision by 2–4% over current open-vocabulary detectors, establishing a foundation for scalable, language-driven thermal perception.
📝 Abstract
Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we develop a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.