🤖 AI Summary
Machine learning–driven Earth system models (ML-ESMs) face a critical challenge in independently verifying their physical credibility under future coupled regimes lacking historical observational constraints.
Method: We propose the first systematic five-dimensional evaluation framework—assessing physical consistency, counterfactual robustness, multi-scale interpretability, cross-task generalizability, and independent third-party validation—integrating physics-constrained diagnostics, counterfactual sensitivity analysis, multi-source observational synergy, eXplainable AI (XAI), and standardized benchmarking protocols.
Contribution/Results: This work delivers the first international ML-ESM comprehensive evaluation guideline, formally adopted by the Coupled Model Intercomparison Project (CMIP) and the AI4Earth community as a core model certification standard. It has directly enabled three major ML-ESM initiatives to implement independent, rigorous assessment pipelines, thereby advancing beyond conventional weather-forecasting model evaluation paradigms.
📝 Abstract
Machine learning (ML) is a revolutionary technology with demonstrable applications across multiple disciplines. Within the Earth science community, ML has been most visible for weather forecasting, producing forecasts that rival modern physics-based models. Given the importance of deepening our understanding and improving predictions of the Earth system on all time scales, efforts are now underway to develop forecasting models into Earth-system models (ESMs), capable of representing all components of the coupled Earth system (or their aggregated behavior) and their response to external changes. Modeling the Earth system is a much more difficult problem than weather forecasting, not least because the model must represent the alternate (e.g., future) coupled states of the system for which there are no historical observations. Given that the physical principles that enable predictions about the response of the Earth system are often not explicitly coded in these ML-based models, demonstrating the credibility of ML-based ESMs thus requires us to build evidence of their consistency with the physical system. To this end, this paper puts forward five recommendations to enhance comprehensive, standardized, and independent evaluation of ML-based ESMs to strengthen their credibility and promote their wider use.