FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously mitigating visual representation degradation and achieving effective cross-modal alignment in remote sensing multimodal pretraining, this paper proposes the first unified framework integrating contrastive learning, masked autoencoding (MAE), and geospatial position-aware contrastive encoding. Trained on satellite image–text pairs, the method jointly optimizes visual reconstruction fidelity, cross-modal semantic alignment, and geolocation awareness—thereby supporting both strong visual downstream performance and zero-shot transfer. On SpaceNet1 semantic segmentation, it achieves a +6% mIOU gain; in k-nearest-neighbor classification, it significantly outperforms SkyCLIP. Crucially, it overcomes the long-standing limitation of MAE-based models—poor cross-modal generalization—by preserving competitive zero-shot classification accuracy while substantially enhancing visual representation quality.

Technology Category

Application Category

📝 Abstract
Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image-text methods like CLIP enable vision-language alignment and zero-shot classification ability, vision-only downstream performance tends to degrade compared to image-only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision-only tasks such as KNN classification and semantic segmentation, +6% mIOU on SpaceNet1, while retaining the ability to perform zero-shot classification, unlike MAE pretrained methods.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Fusion
Image Recognition
Zero-shot Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Learning
Image Masked Prediction
Geolocation Integration
🔎 Similar Papers
No similar papers found.