🤖 AI Summary
Conventional models for climate-driven land surface dynamics prediction suffer from poor spatial generalization and degraded performance in data-scarce regions, while existing vision foundation models incur prohibitive computational costs and lack explicit mechanisms to model spatiotemporal geophysical processes. Method: We propose EarthSurface-FM—the first geoscience-oriented foundation model for land surface modeling—integrating a masked autoencoder backbone, attribute-aware representation learning, location-aware architecture, and residual fine-tuning adapters to explicitly couple static geographic priors with multi-source temporal observations. Contribution/Results: Evaluated on four benchmark datasets (runoff, soil moisture, soil composition, etc.), EarthSurface-FM achieves significant improvements over state-of-the-art methods, especially in data-sparse regions. Crucially, it enables full pretraining and downstream adaptation using only academic-scale compute resources, overcoming key applicability bottlenecks of generic vision models in land surface dynamics modeling.
📝 Abstract
Stewarding natural resources, mitigating floods, droughts, wildfires, and landslides, and meeting growing demands require models that can predict climate-driven land-surface responses and human feedback with high accuracy. Traditional impact models, whether process-based, statistical, or machine learning, struggle with spatial generalization due to limited observations and concept drift. Recently proposed vision foundation models trained on satellite imagery demand massive compute and are ill-suited for dynamic land-surface prediction. We introduce StefaLand, a generative spatiotemporal earth foundation model centered on landscape interactions. StefaLand improves predictions on three tasks and four datasets: streamflow, soil moisture, and soil composition, compared to prior state-of-the-art. Results highlight its ability to generalize across diverse, data-scarce regions and support broad land-surface applications. The model builds on a masked autoencoder backbone that learns deep joint representations of landscape attributes, with a location-aware architecture fusing static and time-series inputs, attribute-based representations that drastically reduce compute, and residual fine-tuning adapters that enhance transfer. While inspired by prior methods, their alignment with geoscience and integration in one model enables robust performance on dynamic land-surface tasks. StefaLand can be pretrained and finetuned on academic compute yet outperforms state-of-the-art baselines and even fine-tuned vision foundation models. To our knowledge, this is the first geoscience land-surface foundation model that demonstrably improves dynamic land-surface interaction predictions and supports diverse downstream applications.