🤖 AI Summary
Existing RGB–visible and infrared pre-trained models achieve strong performance on single-modal tasks but exhibit limited generalization in multimodal collaborative perception (e.g., autonomous driving under adverse weather). To address this, we propose a unified vision foundation model for cross-modal joint understanding. Our method introduces a patch-level cross-modal contrastive learning mechanism coupled with a dual-knowledge retention architecture—inspired by retinal and bipolar cell functionality—to enable feature alignment while mitigating catastrophic forgetting. We further incorporate attention-guided contrastive distillation, LoRA adapters, and synchronized distillation to ensure compatibility across diverse Transformer architectures. Additionally, we construct the large-scale, precisely aligned MVIP dataset. Experiments demonstrate improvements of +1.7 mIoU on thermal semantic segmentation and +0.7 mAP on thermal object detection, while preserving over 99% of baseline performance on visible-light tasks.
📝 Abstract
The demand for joint RGB-visible and infrared perception is growing rapidly, particularly to achieve robust performance under diverse weather conditions. Although pre-trained models for RGB-visible and infrared data excel in their respective domains, they often underperform in multimodal scenarios, such as autonomous vehicles equipped with both sensors. To address this challenge, we propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV), featuring two key innovations. First, we introduce Patch-wise Cross-modality Contrastive Learning (PCCL), an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition, which enables effective cross-modal feature alignment while remaining compatible with any transformer-based architecture. Second, our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing - combining LoRA adapters (2% added parameters) with synchronous distillation to prevent catastrophic forgetting, thereby replicating the retina's photopic (cone-driven) and scotopic (rod-driven) functionality. To support cross-modal learning, we introduce the MVIP dataset, the most comprehensive visible-infrared benchmark to date. It contains 98,992 precisely aligned image pairs spanning diverse scenarios. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU in semantic segmentation and +0.7 mAP in object detection) while maintaining 99%+ of the baseline performance on visible RGB tasks. Our code is available at https://github.com/fangyuanmao/UNIV.