A General-Purpose Data Harmonization Framework: Supporting Reproducible and Scalable Data Integration in the RADx Data Hub

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of achieving FAIR interoperability and reusability for multi-source, heterogeneous scientific data—such as the RADx COVID-19 response data—in large-scale environments, this paper proposes a general-purpose, reproducible, and extensible data harmonization framework. Our approach introduces a novel harmonization paradigm based on parameterized primitive operations and automated execution tracing. It integrates a customizable data representation model, a configurable operation library, and mechanisms for transformation logging and dependency tracking—ensuring protocol reproducibility, process auditability, and transformation reusability. Evaluated in real-world deployment within the RADx Data Hub, the framework significantly lowers the barrier to entry for domain experts, improves harmonization efficiency and transparency, and enables high-quality cross-study analyses. This work provides a scalable, transferable technical pathway for FAIR-compliant data integration across diverse scientific domains.

Technology Category

Application Category

📝 Abstract
In the age of big data, it is important for primary research data to follow the FAIR principles of findability, accessibility, interoperability, and reusability. Data harmonization enhances interoperability and reusability by aligning heterogeneous data under standardized representations, benefiting both repository curators responsible for upholding data quality standards and consumers who require unified datasets. However, data harmonization is difficult in practice, requiring significant domain and technical expertise. We present a software framework to facilitate principled and reproducible harmonization protocols. Our framework implements a novel strategy of building harmonization transformations from parameterizable primitive operations and automated bookkeeping for executed transformations. We establish our data representation model and harmonization strategy and then present a proof-of-concept application in the context of the RADx Data Hub for COVID-19 pandemic response data. We believe that our framework offers a powerful solution for data scientists and curators who value transparency and reproducibility in data harmonization.
Problem

Research questions and friction points this paper is trying to address.

Facilitates data harmonization for FAIR principles compliance
Supports reproducible and scalable data integration processes
Addresses challenges in aligning heterogeneous data sets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameterizable primitive operations for harmonization
Automated bookkeeping of transformation executions
Proof-of-concept in RADx Data Hub application
🔎 Similar Papers
No similar papers found.
J
Jimmy K. Yu
Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
Marcos Martínez-Romero
Marcos Martínez-Romero
Technical Director, Stanford University
Semantic technologyKnowledge managementSemantic WebOntologiesMetadata
M
M. Horridge
Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
M
M. U. Akdogan
Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
M
M. Musen
Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA