🤖 AI Summary
Thermal infrared (TIR) images suffer from the absence of color and texture, leading to visual fatigue and hindering downstream tasks; existing single-band colorization methods lack sufficient spectral information, often resulting in color distortion and semantic ambiguity. To address this, we propose the first cascaded Transformer-based colorization framework specifically designed for multi-band TIR imagery. Our method introduces a novel Spatial-Spectral Attention Residual Block (SARB) and a dual-domain (spatial-frequency) wavelet alignment mechanism to jointly enforce spectral consistency and fine-detail fidelity. By integrating multi-head self-attention, a U-shaped Transformer (STformer), multi-scale wavelet blocks (MSWB), and spectral tokenization, the framework achieves significant improvements on multi-band infrared datasets: +3.2 dB in PSNR and +0.11 in SSIM—demonstrating enhanced visual realism and semantic accuracy.
📝 Abstract
Thermal infrared (TIR) images, acquired through thermal radiation imaging, are unaffected by variations in lighting conditions and atmospheric haze. However, TIR images inherently lack color and texture information, limiting downstream tasks and potentially causing visual fatigue. Existing colorization methods primarily rely on single-band images with limited spectral information and insufficient feature extraction capabilities, which often result in image distortion and semantic ambiguity. In contrast, multiband infrared imagery provides richer spectral data, facilitating the preservation of finer details and enhancing semantic accuracy. In this paper, we propose a generative adversarial network (GAN)-based framework designed to integrate spectral information to enhance the colorization of infrared images. The framework employs a multi-stage spectral self-attention Transformer network (MTSIC) as the generator. Each spectral feature is treated as a token for self-attention computation, and a multi-head self-attention mechanism forms a spatial-spectral attention residual block (SARB), achieving multi-band feature mapping and reducing semantic confusion. Multiple SARB units are integrated into a Transformer-based single-stage network (STformer), which uses a U-shaped architecture to extract contextual information, combined with multi-scale wavelet blocks (MSWB) to align semantic information in the spatial-frequency dual domain. Multiple STformer modules are cascaded to form MTSIC, progressively optimizing the reconstruction quality. Experimental results demonstrate that the proposed method significantly outperforms traditional techniques and effectively enhances the visual quality of infrared images.