UNIV: Unified Foundation Model for Infrared and Visible Modalities

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing RGB–visible and infrared pre-trained models achieve strong performance on single-modal tasks but exhibit limited generalization in multimodal collaborative perception (e.g., autonomous driving under adverse weather). To address this, we propose a unified vision foundation model for cross-modal joint understanding. Our method introduces a patch-level cross-modal contrastive learning mechanism coupled with a dual-knowledge retention architecture—inspired by retinal and bipolar cell functionality—to enable feature alignment while mitigating catastrophic forgetting. We further incorporate attention-guided contrastive distillation, LoRA adapters, and synchronized distillation to ensure compatibility across diverse Transformer architectures. Additionally, we construct the large-scale, precisely aligned MVIP dataset. Experiments demonstrate improvements of +1.7 mIoU on thermal semantic segmentation and +0.7 mAP on thermal object detection, while preserving over 99% of baseline performance on visible-light tasks.

Technology Category

Application Category

📝 Abstract

The demand for joint RGB-visible and infrared perception is growing rapidly, particularly to achieve robust performance under diverse weather conditions. Although pre-trained models for RGB-visible and infrared data excel in their respective domains, they often underperform in multimodal scenarios, such as autonomous vehicles equipped with both sensors. To address this challenge, we propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV), featuring two key innovations. First, we introduce Patch-wise Cross-modality Contrastive Learning (PCCL), an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition, which enables effective cross-modal feature alignment while remaining compatible with any transformer-based architecture. Second, our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing - combining LoRA adapters (2% added parameters) with synchronous distillation to prevent catastrophic forgetting, thereby replicating the retina's photopic (cone-driven) and scotopic (rod-driven) functionality. To support cross-modal learning, we introduce the MVIP dataset, the most comprehensive visible-infrared benchmark to date. It contains 98,992 precisely aligned image pairs spanning diverse scenarios. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU in semantic segmentation and +0.7 mAP in object detection) while maintaining 99%+ of the baseline performance on visible RGB tasks. Our code is available at https://github.com/fangyuanmao/UNIV.

Problem

Research questions and friction points this paper is trying to address.

Improve cross-modal performance for RGB and infrared sensors

Address multimodal underperformance in autonomous vehicle perception

Enable robust perception under diverse weather conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Patch-wise Cross-modality Contrastive Learning

Dual-knowledge preservation mechanism

Transformer-based architecture with LoRA adapters

🔎 Similar Papers

MaeFuse: Transferring Omni Features with Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided Training

2024-04-17arXiv.orgCitations: 0

Microsoft

$6,710 -

San Francisco Bay area / New York City metropolitan area

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)