Metadata practices for simulation workflows

📅 2024-08-30
🏛️ Scientific Data
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale, heterogeneous metadata in scientific simulations impede result reproducibility and cross-team sharing. To address this, we propose a hardware- and software-agnostic, user-definable two-stage metadata governance framework: (1) non-intrusive acquisition of raw metadata, and (2) on-demand, dynamic structuring. Our key contribution is the first lightweight, general-purpose metadata governance paradigm that decouples acquisition from structuring, enabling zero-code integration into existing HPC simulation workflows. Implemented via the Python-based tool Archivist, the framework supports dynamic schema mapping, declarative configuration, and HPC-adapted interfaces. Evaluated in neuroscience and hydrology simulation use cases, it significantly improves metadata completeness, queryability, and cross-team sharing efficiency—thereby strengthening reproducible and sustainable numerical experimentation.

Technology Category

Application Category

📝 Abstract
Computer simulations are an essential pillar of knowledge generation in science. Exploring, understanding, reproducing, and sharing the results of simulations relies on tracking and organizing the metadata describing the numerical experiments. The models used to understand real-world systems, and the computational machinery required to simulate them, are typically complex, and produce large amounts of heterogeneous metadata. Here, we present general practices for acquiring and handling metadata that are agnostic to software and hardware, and highly flexible for the user. These consist of two steps: 1) recording and storing raw metadata, and 2) selecting and structuring metadata. As a proof of concept, we develop the Archivist, a Python tool to help with the second step, and use it to apply our practices to distinct high-performance computing use cases from neuroscience and hydrology. Our practices and the Archivist can readily be applied to existing workflows without the need for substantial restructuring. They support sustainable numerical workflows, fostering replicability, reproducibility, data exploration, and data sharing in simulation-based research.
Problem

Research questions and friction points this paper is trying to address.

Tracking and organizing metadata for simulation workflows
Handling heterogeneous metadata from complex computational models
Ensuring replicability and data sharing in simulation research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agnostic metadata handling for simulations
Two-step metadata recording and structuring
Python tool Archivist for workflow integration
🔎 Similar Papers
No similar papers found.
J
Jose Villamar
Institute for Advanced Simulation (IAS-6), Jülrich Research Centre, Jülrich, Germany; RWTH Aachen University, Aachen, Germany
M
M. Kelbling
Department of Computational Hydrosystems, Helmholtz-Centre for Environmental Research, Leipzig, Germany
H
Heather L. More
Institute for Advanced Simulation (IAS-9), Jülrich Research Centre, Jülrich, Germany
M
Michael Denker
Institute for Advanced Simulation (IAS-6), Jülrich Research Centre, Jülrich, Germany
Tom Tetzlaff
Tom Tetzlaff
Jülich Research Centre, Germany
Computational Neuroscience
Johanna Senk
Johanna Senk
University of Sussex & Forschungszentrum Juelich
S
Stephan Thober
Department of Computational Hydrosystems, Helmholtz-Centre for Environmental Research, Leipzig, Germany