🤖 AI Summary
This work addresses the limited performance of existing vision-language models under low-light conditions, where thermal infrared (TIR) information remains underutilized. We propose a wavelength-aware multimodal fusion architecture that integrates a trainable TIR encoder with a text-guided dual-attention fusion module to inject thermal perception capabilities into the frozen Molmo-7B model. This design enables prompt-conditioned multispectral reasoning while preserving the original RGB–language interface. To support this research, we introduce Thermo-VL-Bench, the first pixel-aligned RGB–TIR instruction-tuning dataset and evaluation benchmark. Experimental results demonstrate significant performance gains on both pure TIR and complex multispectral reasoning tasks, validating the effectiveness of our approach. The code and dataset are publicly released.
📝 Abstract
Vision-language models (VLMs) often fail under low illumination because their visual grounding is learned predominantly from RGB imagery, whereas thermal infrared preserves complementary scene structure when visible cues degrade. We present Thermo-VL, a wavelength-aware VLM that augments a frozen Molmo-7B backbone with a trainable thermal encoder and a text-guided dual-attention fusion module. Given aligned RGB tokens, thermal tokens, and prompt embeddings, the fusion module conditions thermal features on both language and RGB context, then injects a gated residual into the frozen RGB stream so thermal evidence can be incorporated without disrupting Molmo's pretrained RGB-language interface. We train the model with the standard language-modeling objective together with auxiliary alignment and regularization losses that improve cross-modal grounding and reduce over-reliance on RGB. We also introduce a pixel-aligned RGB-thermal instruction-tuning dataset and Thermo-VL-Bench, a manually screened RGB-thermal VQA benchmark for low-light and cross-spectrum reasoning. Experiments show strong gains on challenging thermal-only and RGB+thermal reasoning tasks, highlighting the value of prompt-conditioned multispectral fusion. Our dataset and code are publicly available at: https://thusharakart.github.io/Thermo-VL