FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing foundation models in remote sensing struggle to accommodate the multi-source, multi-scale, and multi-modal observational variations inherent in ecological applications. To address this limitation, this work proposes FLORO, a multimodal geospatial foundation model built upon a masked autoencoder architecture. FLORO is pretrained on a small yet highly heterogeneous dataset that integrates Sentinel-1/2, SkySAT, UAV imagery, and elevation data. The model incorporates an availability-aware input mechanism and geographic position encoding to uniformly handle diverse sensor configurations. Evaluated on the PANGAEA benchmark, FLORO achieves second place in segmentation tasks and demonstrates robust performance in classification and regression, significantly outperforming most competing models while more effectively preserving fine-grained spatial structural details.
📝 Abstract
Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.
Problem

Research questions and friction points this paper is trying to address.

foundation model
remote sensing
multimodal
ecological applications
sensor variability
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal foundation model
availability-aware input
heterogeneous remote sensing
masked autoencoding
cross-sensor transfer
J
Jorge L. Rodriguez
Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
V
Victor Angulo Morales
Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
A
Areej Alwahas
Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
M
Mariana Elias Lara
Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Fida Mohammad Thoker
Fida Mohammad Thoker
University of Amsterdam
Computer VisionAction RecognitionDeep Learning
K
Kasper Johansen
Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Bernard Ghanem
Bernard Ghanem
Professor, King Abdullah University of Science and Technology
computer visionmachine learning
F
Fernando T. Maestre
Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
M
Matthew F. McCabe
Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia