🤖 AI Summary
Unsupervised machine learning in scientific discovery lacks standardized, verifiable, and reproducible workflows. Method: This paper introduces the first end-to-end framework for reproducible scientific discovery, comprising (i) formal modeling of problem verifiability, (ii) robust exploratory data analysis, (iii) multi-method joint modeling (HDBSCAN, UMAP, t-SNE), (iv) stability assessment via bootstrap resampling and consensus clustering, and (v) structured result documentation. The framework shifts from conventional pattern mining to verification-driven inference, systematically addressing methodological gaps. Contribution/Results: Applied to chemical classification of Milky Way globular clusters, the framework significantly enhances the physical interpretability and cross-dataset generalizability of clustering outcomes. It enables verifiable, reproducible discoveries in astrochemistry—demonstrating rigorous scientific utility beyond heuristic exploration.
📝 Abstract
Unsupervised machine learning is widely used to mine large, unlabeled datasets to make data-driven discoveries in critical domains such as climate science, biomedicine, astronomy, chemistry, and more. However, despite its widespread utilization, there is a lack of standardization in unsupervised learning workflows for making reliable and reproducible scientific discoveries. In this paper, we present a structured workflow for using unsupervised learning techniques in science. We highlight and discuss best practices starting with formulating validatable scientific questions, conducting robust data preparation and exploration, using a range of modeling techniques, performing rigorous validation by evaluating the stability and generalizability of unsupervised learning conclusions, and promoting effective communication and documentation of results to ensure reproducible scientific discoveries. To illustrate our proposed workflow, we present a case study from astronomy, seeking to refine globular clusters of Milky Way stars based upon their chemical composition. Our case study highlights the importance of validation and illustrates how the benefits of a carefully-designed workflow for unsupervised learning can advance scientific discovery.