π€ AI Summary
Spectral imaging suffers from poor generalization and cross-device transferability of existing AI models due to substantial inter-camera variations in spectral channel count and wavelength response. To address this, we propose CARLβa camera-agnostic representation learning framework enabling unified modeling of RGB, multispectral, and hyperspectral images for the first time. Our key innovations include: (1) wavelength positional encoding to explicitly incorporate spectral physical priors; (2) a query-based joint self- and cross-attention compression mechanism for efficient spectral-spatial information fusion; and (3) a JEPA-inspired spectral-spatial self-supervised pretraining paradigm. CARL achieves state-of-the-art performance on medical imaging, autonomous driving, and satellite remote sensing tasks. It demonstrates strong robustness under simulated and real-world cross-camera spectral variations and supports plug-and-play downstream adaptation. CARL establishes the first foundational spectral representation model with cross-modal and cross-camera generalization capability.
π Abstract
Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding, and is already established as a critical modality in remote sensing. However, variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies, leading to camera-specific models with limited generalizability and inadequate cross-camera applicability. To address this bottleneck, we introduce $ extbf{CARL}$, a model for $ extbf{C}$amera-$ extbf{A}$gnostic $ extbf{R}$epresentation $ extbf{L}$earning across RGB, multispectral, and hyperspectral imaging modalities. To enable the conversion of a spectral image with any channel dimensionality to a camera-agnostic embedding, we introduce wavelength positional encoding and a self-attention-cross-attention mechanism to compress spectral information into learned query representations. Spectral-spatial pre-training is achieved with a novel spectral self-supervised JEPA-inspired strategy tailored to CARL. Large-scale experiments across the domains of medical imaging, autonomous driving, and satellite imaging demonstrate our model's unique robustness to spectral heterogeneity, outperforming on datasets with simulated and real-world cross-camera spectral variations. The scalability and versatility of the proposed approach position our model as a backbone for future spectral foundation models.