🤖 AI Summary
Machine learning in collaborative eScience suffers from poor reproducibility and opaque decision traceability, primarily due to fragmented workflows, informal data sharing, and loosely coupled toolchains. To address this, we propose a lifecycle-aware data infrastructure for collaborative eScience, introducing six structured core artifacts—Dataset, Feature, Workflow, Execution, Asset, and Vocabulary—that formally model and version-control the relationships among data, code, and decisions across their entire lifecycle. Our approach integrates data versioning, semantic metadata modeling, workflow orchestration, controlled vocabulary management, and experiment lineage tracking. Evaluated on a glaucoma clinical detection use case, the infrastructure significantly improves cross-team experiment reproducibility, fully preserves collaborative decision lineage, and enables interpretable analysis under long-term iterative development. It establishes a foundational paradigm for ML-driven collaborative research that is reproducible, auditable, and sustainably evolvable.
📝 Abstract
Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.