ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the limited generalization of existing vision-language models (VLMs) in thermal imaging scenarios and the absence of evaluation benchmarks tailored to the unique characteristics of thermal images. To this end, we introduce ThermEval, the first vision-language understanding benchmark specifically designed for thermal imaging, comprising approximately 55,000 thermal visual question answering pairs. We also present a new dataset, ThermEval-D, which incorporates pixel-level temperature maps and semantic annotations of body parts for the first time. Using this benchmark, we systematically evaluate 25 prominent VLMs, revealing significant deficiencies in tasks such as temperature reasoning and robustness to colormap transformations. Our findings underscore the necessity of domain-specific evaluation frameworks for thermal imaging and challenge the prevailing RGB-centric assessment paradigm.

Technology Category

Application Category

📝 Abstract

Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.

Problem

Research questions and friction points this paper is trying to address.

thermal imagery

vision-language models

benchmark

visual question answering

temperature grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Thermal Vision-Language Models

Structured Benchmark

Thermal Visual Question Answering