🤖 AI Summary
This study addresses the design of efficient, general-purpose, and scalable embedding representations for Earth observation (EO) tasks by systematically evaluating key architectural and training choices when using GeoFM as a feature extractor. These include backbone architecture, pretraining strategy, representation depth, spatial aggregation, and multi-objective fusion. Experiments on the NeuCo-Bench benchmark demonstrate that a Transformer backbone combined with mean pooling establishes a strong baseline; intermediate ResNet layers outperform final-layer features; self-supervised objectives offer task-specific advantages; and multi-objective fusion substantially enhances robustness. The resulting embeddings compress the original input by over 500× while maintaining high performance across diverse downstream EO tasks, underscoring the critical role of thoughtful embedding design in enabling scalable EO workflows.
📝 Abstract
Earth observation (EO) missions produce petabytes of multispectral imagery, increasingly analyzed using large Geospatial Foundation Models (GeoFMs). Alongside end-to-end adaptation, workflows make growing use of intermediate representations as task-agnostic embeddings, enabling models to compute representations once and reuse them across downstream tasks. Consequently, when GeoFMs act as feature extractors, decisions about how representations are obtained, aggregated, and combined affect downstream performance and pipeline scalability. Understanding these trade-offs is essential for scalable embedding-based EO workflows, where compact embeddings can replace raw data while remaining broadly useful. We present a systematic analysis of embedding design in GeoFM-based EO workflows. Leveraging NeuCo-Bench, we study how backbone architecture, pretraining strategy, representation depth, spatial aggregation, and representation combination influence EO task performance. We demonstrate the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data. Across models, we find consistent trends: transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives exhibit task-specific strengths, and combining embeddings from different objectives often improves robustness.