Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation

📅 2024-09-25
🏛️ Technologies
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses high-fidelity spectral translation from RGB to near-infrared (NIR) images—formulated as spectral-aware semantic reconstruction rather than conventional domain translation. We introduce the first application of vision foundation models (VFMs) to RGB-NIR translation and propose a cross-attention-enhanced encoder-decoder architecture that jointly enforces global semantic consistency and local spectral fidelity. To further improve photorealism and structural accuracy, we integrate a multi-scale PatchGAN discriminator with a composite loss function combining global contextual constraints and local feature alignment. Quantitative evaluation on RANUS and IDD-AW benchmarks demonstrates significant FID improvements over state-of-the-art methods. Moreover, the synthesized NIR images substantially enhance downstream object detection performance, enabling zero-cost NIR data augmentation without requiring additional hardware or acquisition overhead.

Technology Category

Application Category

📝 Abstract
This paper proposes Pix2Next, a novel image-to-image translation framework designed to address the challenge of generating high-quality Near-Infrared (NIR) images from RGB inputs. Our method leverages a state-of-the-art Vision Foundation Model (VFM) within an encoder–decoder architecture, incorporating cross-attention mechanisms to enhance feature integration. This design captures detailed global representations and preserves essential spectral characteristics, treating RGB-to-NIR translation as more than a simple domain transfer problem. A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation. We performed experiments on the RANUS and IDD-AW datasets to demonstrate Pix2Next’s advantages in quantitative metrics and visual quality, highly improving the FID score compared to existing methods. Furthermore, we demonstrate the practical utility of Pix2Next by showing improved performance on a downstream object detection task using generated NIR data to augment limited real NIR datasets. The proposed method enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.
Problem

Research questions and friction points this paper is trying to address.

Generating high-quality NIR images from RGB inputs
Enhancing feature integration with cross-attention mechanisms
Improving downstream tasks with augmented NIR datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Vision Foundation Model for RGB-NIR translation
Uses cross-attention for enhanced feature integration
Multi-scale PatchGAN ensures realistic image generation
🔎 Similar Papers
No similar papers found.
Y
Youngwan Jin
School of Integrated Technology, Yonsei University, Incheon, 21983, Republic of Korea.
I
Incheol Park
School of Integrated Technology, Yonsei University, Incheon, 21983, Republic of Korea.
Hanbin Song
Hanbin Song
School of Integrated Technology, Yonsei University, Incheon, 21983, Republic of Korea.
H
Hyeongjin Ju
School of Integrated Technology, Yonsei University, Incheon, 21983, Republic of Korea.
Y
Yagiz Nalcakan
School of Integrated Technology, Yonsei University, Incheon, 21983, Republic of Korea.
Shiho Kim
Shiho Kim
School of Integrated Technology, Yonsei University
Intelligent semiconductorsIntelligent VehiclesArtificial IntelligenceQML