🤖 AI Summary
Existing RGB-to-TIR image translation methods often disregard thermophysical principles, resulting in distorted thermal distributions and limited controllability at both scene and object levels. To address this, this work proposes TherA, a novel framework that introduces, for the first time, a thermal-aware visual-language prompting mechanism. By integrating a thermal-aware vision-language model (TherA-VLM) with a latent diffusion model, TherA translates user-provided prompts into thermal-aware embeddings encoding scene context, object identity, material properties, and thermal radiation characteristics. These embeddings condition the diffusion process to enable fine-grained control over synthesis dimensions such as time of day, weather conditions, and object states. Evaluated in zero-shot translation settings, TherA achieves an average performance gain of 33% over existing approaches, establishing new state-of-the-art results in both thermodynamic plausibility and generation diversity.
📝 Abstract
Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.