From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Machine learning in collaborative eScience suffers from poor reproducibility and opaque decision traceability, primarily due to fragmented workflows, informal data sharing, and loosely coupled toolchains. To address this, we propose a lifecycle-aware data infrastructure for collaborative eScience, introducing six structured core artifacts—Dataset, Feature, Workflow, Execution, Asset, and Vocabulary—that formally model and version-control the relationships among data, code, and decisions across their entire lifecycle. Our approach integrates data versioning, semantic metadata modeling, workflow orchestration, controlled vocabulary management, and experiment lineage tracking. Evaluated on a glaucoma clinical detection use case, the infrastructure significantly improves cross-team experiment reproducibility, fully preserves collaborative decision lineage, and enables interpretable analysis under long-term iterative development. It establishes a foundational paradigm for ML-driven collaborative research that is reproducible, auditable, and sustainably evolvable.

Technology Category

Application Category

📝 Abstract
Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.
Problem

Research questions and friction points this paper is trying to address.

Addressing reproducibility challenges in collaborative ML projects
Overcoming fragmented workflows in machine learning experiments
Enhancing traceability in data-centric ML lifecycle management
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-centric framework with six structured artifacts
Versioned and traceable ML experiments lifecycle
Clinical ML use case for glaucoma detection
Z
Zhiwei Li
Dept. of Industrial and Systems Engineering, University of Southern California, Los Angeles, USA
Carl Kesselman
Carl Kesselman
Professor University of Southern California Information Sciences Institute
computer sciencemedicinebioinformatics
T
Tran Huy Nguyen
Dept. of Computer Science, University of Southern California, Los Angeles, USA
B
Benjamin Yixing Xu
Dept. of Ophthalmology, University of Southern California, Los Angeles, USA
K
Kyle Bolo
Dept. of Ophthalmology, University of Southern California, Los Angeles, USA
K
Kimberley Yu
Dept. of Ophthalmology, University of Southern California, Los Angeles, USA