The impact of internal variability on benchmarking deep learning climate emulators

📅 2024-08-09
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Internal climate variability systematically biases benchmark evaluations of AI-based climate emulators, particularly in low-sample-size settings (e.g., 3-member ClimateBench), where high internal variability causes simple linear models to outperform sophisticated large models (e.g., ClimaX) on nonlinear variables such as precipitation. Method: We propose a target reconstruction paradigm grounded in multi-member ensemble averaging—using a 50-member MPI-ESM simulation—to shift evaluation emphasis from single-trajectory fidelity to statistical reproducibility. Contribution/Results: Under this revised benchmark, ClimaX significantly surpasses linear baselines in precipitation simulation, confirming its superior nonlinear modeling capacity; temperature prediction remains dominated by linear methods. This work provides the first quantitative assessment of internal variability’s biasing effect on emulator evaluation, establishes a new standard for climate AI benchmarking, and publicly releases all code, data, and interactive tutorials.

Technology Category

Application Category

📝 Abstract
Full-complexity Earth system models (ESMs) are computationally very expensive, limiting their use in exploring the climate outcomes of multiple emission pathways. More efficient emulators that approximate ESMs can directly map emissions onto climate outcomes, and benchmarks are being used to evaluate their accuracy on standardized tasks and datasets. We investigate a popular benchmark in data-driven climate emulation, ClimateBench, on which deep learning-based emulators are currently achieving the best performance. We implement a linear regression-based emulator, akin to pattern scaling, and find that it outperforms the incumbent 100M-parameter deep learning foundation model, ClimaX, on 3 out of 4 regionally-resolved surface-level climate variables. While emulating surface temperature is expected to be predominantly linear, this result is surprising for emulating precipitation. We identify that this outcome is a result of high levels of internal variability in the benchmark targets. To address internal variability, we update the benchmark targets with ensemble averages from the MPI-ESM1.2-LR model that contain 50 instead of 3 climate simulations per emission pathway. Using the new targets, we show that linear pattern scaling continues to be more accurate on temperature, but can be outperformed by a deep learning-based model for emulating precipitation. We publish our code, data, and an interactive tutorial at github.com/blutjens/climate-emulator.
Problem

Research questions and friction points this paper is trying to address.

Evaluating deep learning emulators' accuracy in climate prediction
Addressing overfitting in emulators due to internal variability noise
Comparing linear and deep learning methods for climate variable emulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep learning emulators benchmarked against linear regression
Increased ensemble size reduces overfitting in emulators
Deep learning excels in precipitation emulation post-adjustment
🔎 Similar Papers
No similar papers found.
Björn Lütjens
Björn Lütjens
Department of Earth, Atmospheric, and Planetary Sciences, Massachusetts Institute of Technology
R
Raffaele Ferrari
Department of Earth, Atmospheric, and Planetary Sciences, Massachusetts Institute of Technology
Duncan Watson-Parris
Duncan Watson-Parris
University of California San Diego
Atmospheric PhysicsCloudsAerosols
N
N. Selin
Institute for Data, Systems and Society, Massachusetts Institute of Technology