From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Machine learning in collaborative eScience suffers from poor reproducibility and opaque decision traceability, primarily due to fragmented workflows, informal data sharing, and loosely coupled toolchains. To address this, we propose a lifecycle-aware data infrastructure for collaborative eScience, introducing six structured core artifacts—Dataset, Feature, Workflow, Execution, Asset, and Vocabulary—that formally model and version-control the relationships among data, code, and decisions across their entire lifecycle. Our approach integrates data versioning, semantic metadata modeling, workflow orchestration, controlled vocabulary management, and experiment lineage tracking. Evaluated on a glaucoma clinical detection use case, the infrastructure significantly improves cross-team experiment reproducibility, fully preserves collaborative decision lineage, and enables interpretable analysis under long-term iterative development. It establishes a foundational paradigm for ML-driven collaborative research that is reproducible, auditable, and sustainably evolvable.

Technology Category

Application Category

📝 Abstract

Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.

Problem

Research questions and friction points this paper is trying to address.

Addressing reproducibility challenges in collaborative ML projects

Overcoming fragmented workflows in machine learning experiments

Enhancing traceability in data-centric ML lifecycle management

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-centric framework with six structured artifacts

Versioned and traceable ML experiments lifecycle

Clinical ML use case for glaucoma detection

🔎 Similar Papers

Reproducibility in Machine Learning-based Research: Overview, Barriers and Drivers