🤖 AI Summary
This work addresses high-fidelity spectral translation from RGB to near-infrared (NIR) images—formulated as spectral-aware semantic reconstruction rather than conventional domain translation. We introduce the first application of vision foundation models (VFMs) to RGB-NIR translation and propose a cross-attention-enhanced encoder-decoder architecture that jointly enforces global semantic consistency and local spectral fidelity. To further improve photorealism and structural accuracy, we integrate a multi-scale PatchGAN discriminator with a composite loss function combining global contextual constraints and local feature alignment. Quantitative evaluation on RANUS and IDD-AW benchmarks demonstrates significant FID improvements over state-of-the-art methods. Moreover, the synthesized NIR images substantially enhance downstream object detection performance, enabling zero-cost NIR data augmentation without requiring additional hardware or acquisition overhead.
📝 Abstract
This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our method leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder–decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS and IDD-AW datasets to demonstrate Pix2Next’s advantages in quantitative metrics and visual quality, highly improving the FID score compared to existing methods. Furthermore, we demonstrate the practical utility of Pix2Next by showing improved performance on a downstream object detection task using generated NIR data to augment limited real NIR datasets. The proposed method enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.