🤖 AI Summary
This study addresses the lack of standardized evaluation protocols, training procedures, and model weight release practices in geospatial foundation model (GFM) research, which has led to incomparable and irreproducible results. Through a systematic audit of 152 GFM papers, we identify pervasive inconsistencies in evaluation setups, data configurations, and model sharing practices, revealing a critical coordination failure in the field. Our analysis shows that 46 instances of identical models evaluated on the same benchmarks exhibit performance discrepancies exceeding 10 points, 94 out of 126 studies rely on proprietary pretraining data, and 39% fail to release model weights. To remedy these issues, we propose six actionable community norms—including licensed public release of weights, adoption of shared evaluation frameworks, and standardized baseline reporting—to establish a reproducible and comparable benchmarking foundation for future GFM research.
📝 Abstract
Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.