Towards Foundation Models: Evaluation of Geoscience Artificial Intelligence with Uncertainty

📅 2025-01-15
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Seismology lacks a robust deep learning model (DLM) evaluation framework that jointly accounts for performance uncertainty, learning efficiency, and data overlap effects. Method: We propose the first three-dimensional evaluation paradigm for Earth science foundation models (FMs), integrating uncertainty quantification, data-splitting robustness, and learning efficiency analysis; we introduce a geoscience-feature-clustering–based data partitioning strategy to expose performance inflation caused by train-test data overlap. Contribution/Results: Using multi-random-seed and multi-sample Monte Carlo evaluation, PhaseNet benchmarking, and training-budget sensitivity analysis, we demonstrate that state-of-the-art phase-picking models exhibit performance fluctuations up to ±12.7%, and quantify that data overlap can artificially inflate FM metrics by up to 34%. This framework substantially enhances the reliability of model selection and the accuracy of performance forecasting.

Technology Category

Application Category

📝 Abstract
Artificial intelligence (AI) has transformed the geoscience community with deep learning models (DLMs) that are trained to complete specific tasks within workflows. This success has led to the development of geoscience foundation models (FMs), which promise to accomplish multiple tasks within a workflow or replace the workflow altogether. However, lack of robust evaluation frameworks, even for traditional DLMs, leaves the geoscience community ill prepared for the inevitable adoption of FMs. We address this gap by designing an evaluation framework that jointly incorporates three crucial aspects to current DLMs and future FMs: performance uncertainty, learning efficiency, and overlapping training-test data splits. To target the three aspects, we meticulously construct the training, validation, and test splits using clustering methods tailored to geoscience data and enact an expansive training design to segregate performance uncertainty arising from stochastic training processes and random data sampling. The framework's ability to guard against misleading declarations of model superiority is demonstrated through evaluation of PhaseNet, a popular seismic phase picking DLM, under 3 training approaches. Furthermore, we show how the performance gains due to overlapping training-test data can lead to biased FM evaluation. Our framework helps practitioners choose the best model for their problem and set performance expectations by explicitly analyzing model performance at varying budgets of training data.
Problem

Research questions and friction points this paper is trying to address.

Lack robust evaluation frameworks for seismic deep learning models
Need assess performance uncertainty and learning efficiency jointly
Framework prevents misleading claims of model superiority
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluation framework with performance uncertainty
Clustering method for seismic data splits
Expansive training design for model comparison
🔎 Similar Papers
No similar papers found.
S
Samuel Myren
Los Alamos National Laboratory | Virginia Tech
N
Nidhi Parikh
Los Alamos National Laboratory
R
R. Rael
Los Alamos National Laboratory
G
Garrison Flynn
Los Alamos National Laboratory
Dave Higdon
Dave Higdon
Virginia Tech
StatisticsUncertainty Quantification
Emily Casleton
Emily Casleton
Scientist, Los Alamos National Laboratory
statisticsBayesian non-parametricsAI testing and evaluationAI research