🤖 AI Summary
Widespread irreproducibility in data science stems from uncertainty introduced by human judgment throughout the Data Science Life Cycle (DSLC), which conventional statistical methods cannot adequately quantify. To address this, we propose the PCS-Workflow—a streamlined analytical framework grounded in the Prediction, Computation, and Stability (PCS) paradigm—that uniquely integrates generative AI across the entire DSLC to enable guided, verifiable data analysis. Our work innovatively identifies and quantifies the propagation of uncertainty from subjective decisions—particularly in critical stages such as data cleaning—to downstream predictive performance. Empirical evaluation across multiple case studies demonstrates that the PCS-Workflow significantly improves analytical transparency and result stability, enables explicit uncertainty modeling, and enhances result credibility. This provides a practical, deployable paradigm for reproducible data science in the AI era.
📝 Abstract
Data science is a pillar of artificial intelligence (AI), which is transforming nearly every domain of human activity, from the social and physical sciences to engineering and medicine. While data-driven findings in AI offer unprecedented power to extract insights and guide decision-making, many are difficult or impossible to replicate. A key reason for this challenge is the uncertainty introduced by the many choices made throughout the data science life cycle (DSLC). Traditional statistical frameworks often fail to account for this uncertainty. The Predictability-Computability-Stability (PCS) framework for veridical (truthful) data science offers a principled approach to addressing this challenge throughout the DSLC. This paper presents an updated and streamlined PCS workflow, tailored for practitioners and enhanced with guided use of generative AI. We include a running example to display the PCS framework in action, and conduct a related case study which showcases the uncertainty in downstream predictions caused by judgment calls in the data cleaning stage.