🤖 AI Summary
To address efficiency and accuracy bottlenecks in remote sensing cross-modal content-based image retrieval (CBIR) caused by the explosive growth and multi-source heterogeneity of remote sensing imagery, this paper proposes REJEPA—an unsupervised joint embedding prediction architecture. REJEPA abandons pixel-level reconstruction and instead directly predicts target semantic embeddings in feature space via spatially contextualized token encoding; it further incorporates VICReg regularization to prevent encoder collapse. Its novel feature-space joint embedding prediction paradigm achieves sensor-agnostic representation learning, high retrieval accuracy, and low computational overhead—reducing FLOPs by 40–60% compared to MAE. On multi-source benchmarks BEN-14K and FMoW, REJEPA outperforms state-of-the-art methods such as CSMAE-SESD by 5.1–10.1% in retrieval accuracy, demonstrating superior cross-modal generalization and robustness in complex scenes.
📝 Abstract
The rapid expansion of remote sensing image archives demands the development of strong and efficient techniques for content-based image retrieval (RS-CBIR). This paper presents REJEPA (Retrieval with Joint-Embedding Predictive Architecture), an innovative self-supervised framework designed for unimodal RS-CBIR. REJEPA utilises spatially distributed context token encoding to forecast abstract representations of target tokens, effectively capturing high-level semantic features and eliminating unnecessary pixel-level details. In contrast to generative methods that focus on pixel reconstruction or contrastive techniques that depend on negative pairs, REJEPA functions within feature space, achieving a reduction in computational complexity of 40-60% when compared to pixel-reconstruction baselines like Masked Autoencoders (MAE). To guarantee strong and varied representations, REJEPA incorporates Variance-Invariance-Covariance Regularisation (VICReg), which prevents encoder collapse by promoting feature diversity and reducing redundancy. The method demonstrates an estimated enhancement in retrieval accuracy of 5.1% on BEN-14K (S1), 7.4% on BEN-14K (S2), 6.0% on FMoW-RGB, and 10.1% on FMoW-Sentinel compared to prominent SSL techniques, including CSMAE-SESD, Mask-VLM, SatMAE, ScaleMAE, and SatMAE++, on extensive RS benchmarks BEN-14K (multispectral and SAR data), FMoW-RGB and FMoW-Sentinel. Through effective generalisation across sensor modalities, REJEPA establishes itself as a sensor-agnostic benchmark for efficient, scalable, and precise RS-CBIR, addressing challenges like varying resolutions, high object density, and complex backgrounds with computational efficiency.