OpenSeisML: Open Large-Scale Real Seismic and well-log Dataset for Generative AI

πŸ“… 2026-05-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

221K/year
πŸ€– AI Summary
This study addresses the limited development of generative AI in seismic inversion due to the scarcity of real velocity model data, much of which remains proprietary within oil and gas companies. Leveraging publicly available well-log and seismic data from the UK National Data Repository, this work presents the first open-source, large-scale, and realistic joint dataset tailored for generative AI. An automated pipeline is employed to establish time–depth relationships using checkshot surveys, interpolate velocity models, and perform time-to-depth conversion of seismic data, while simultaneously executing data cleaning and standardization. The resulting dataset enables the training of generative models capable of synthesizing statistically consistent subsurface velocity models that effectively capture geological uncertainty, thereby providing reliable prior information for seismic inversion.
πŸ“ Abstract
The advent of machine learning (ML) and computer vision has significantly accelerated seismic inversion workflows by reducing the computational cost of traditionally expensive iterative methods. However, the development and evaluation of ML methods remain limited by the scarcity of realistic velocity models, as most high-quality data are privately owned by oil and gas companies. To address this gap, we present OpenSeisML, a collection of real seismic datasets designed to support generative AI (Gen-AI) workflows for seismic inversion. The datasets are curated from publicly available surveys in the UK National Data Repository (NDR). When seismic volumes are in the time domain and wells are in depth, a time-to-depth conversion is required. We use checkshot data to establish the time-depth relationship and construct a velocity model through interpolation for accurate conversion of post-stack seismic data. Here, we present an automated data curation pipeline that enables seismic data preparation while ensuring reproducibility. The objective is to train a generative model that captures the statistical distribution of subsurface properties, enabling the synthesis of multiple statistically consistent realizations for uncertainty quantification which can act as a prior for seismic inversion.
Problem

Research questions and friction points this paper is trying to address.

seismic inversion
generative AI
velocity model
data scarcity
well-log dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenSeisML
generative AI
seismic inversion
time-to-depth conversion
automated data curation
πŸ”Ž Similar Papers