UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning

📅 2024-04-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RGB-IR semantic analysis methods suffer from weak generalizability and poor scalability, primarily due to the absence of large-scale pre-trained foundation models for the infrared (IR) modality; they thus rely heavily on task-specific architectures and cross-modal fine-tuning. To address this, we propose the first lightweight adaptation framework for unified visible-infrared semantic analysis. Our approach innovatively introduces a Multi-scale Feature Pooling (MFP) module and a Supplementary Feature Injector (SFI), enabling efficient IR contextual modeling and cross-modal knowledge transfer while keeping the ViT-Base backbone frozen. With only a small number of trainable parameters, our method significantly enhances robustness under low-light and adverse weather conditions. It achieves state-of-the-art performance across multiple RGB-IR tasks—including semantic segmentation and object detection—establishing a scalable, highly generalizable paradigm for multimodal visual understanding.

Technology Category

Application Category

📝 Abstract
Semantic analysis on visible (RGB) and infrared (IR) images has gained significant attention due to their enhanced accuracy and robustness under challenging conditions including low-illumination and adverse weather. However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module, and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance. The source code and results are available at https://github.com/PoTsui99/UniRGB-IR.git.
Problem

Research questions and friction points this paper is trying to address.

Lack of pre-trained models for infrared image analysis
Poor scalability in existing RGB-IR semantic frameworks
Limited generalization in current RGB-IR task methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapter mechanism for RGB-IR feature fusion
Frozen ViT with optimized MFP and SFI modules
Multi-modal Feature Pool for contextual enhancement
🔎 Similar Papers
No similar papers found.