🤖 AI Summary
This study addresses the fragmentation of mass spectrometry data, annotations, and metadata in untargeted metabolomics, which hinders traceable and reusable knowledge generation. To overcome this limitation, the authors propose MetaboKG—a analysis-centric knowledge graph framework that integrates GNPS molecular networks, public repository metadata, and multiple ontologies (including MS, ChEBI, and NCBITaxon) into a unified semantic model grounded in PROV-O and SIO. The framework introduces an extended USI identifier to enable deferred binding and cross-analysis linking. Validated on 680 GNPS datasets, MetaboKG effectively supports complex queries related to biochemical enrichment, environment-specific patterns, and cross-instrument variability, thereby facilitating traceable annotation reuse and reproducible semantic exploration.
📝 Abstract
Untargeted metabolomics generates large volumes of tandem mass spectrometry (MS/MS) data and computational annotations that can reveal molecular mechanisms across organisms and environments. Public reuse has improved through harmonized repository metadata and access infrastructures such as Pan-ReDU, and through metabolomics knowledge graphs such as ENPKG and METRIN-KG. Yet the analytical layer remains fragmented: spectra, features, workflow outputs, annotations, confidence evidence, and contextual metadata are still scattered across repositories and tabular artifacts. We present MetaboKG, an analysis-centric knowledge graph framework for engineering reusable metabolomics knowledge from public repositories, metadata, and GNPS molecular network results. MetaboKG contributes a transformation workflow that preserves links between repository exports, analytical files, spectra, features, and annotation results; a semantic model grounded in PROV-O and SIO and aligned with the Mass Spectrometry ontology (MS), ChEBI, NCBITaxon, ENVO, and NCIT to represent provenance, analytical evidence, metadata attributes, and controlled vocabulary terms; and a Universal Annotation Identifier strategy extending the Universal Spectrum Identifier (USI) with workflow-specific components for late binding, incremental ingestion, and post hoc linkage across analyses. We demonstrate MetaboKG at the public-repository scale on 680 GNPS molecular networking results and evaluate it through competency questions covering biochemical enrichment, environmental specificity, and cross instrument analytical variation. Results show that graph-based integration supports traceable annotation reuse and reproducible SPARQL exploration of biochemical relationships that remain fragmented across repository-native resources.