🤖 AI Summary
This study addresses the poor reproducibility and limited scalability of analyses on sparse healthcare datasets—such as NHANES—when investigating hospitalization risk factors (e.g., diabetes, obesity, cardiovascular disease). We propose the first modular, generalizable, cloud-native DevOps analytics framework. Methodologically, it integrates CI/CD pipelines (GitHub Actions/Jenkins), cloud platforms (AWS/Azure), the Python scientific stack, data version control, and synthetic data generation to enable automated NHANES data updates, hybrid modeling, and end-to-end analysis. Key contributions include: (1) substantially improved analytical reliability and reproducibility; (2) seamless co-scaling of real and synthetic data; and (3) empirical validation of cross-domain transferability to other sparse-data domains—including environmental science and cybersecurity—establishing a general-purpose infrastructure for multi-domain health data analytics.
📝 Abstract
A scalable and reliable system is required to analyze the National Health and Nutrition Examination Survey (NHANES) data efficiently to understand hospital utilization risk factors. This study aims to investigate the integration of continuous integration and deployment (CI/CD) practices in data science workflows, specifically focusing on analyzing NHANES data to identify the prevalence of diabetes, obesity, and cardiovascular diseases. An end-to-end cloud-based DevOps framework is proposed for data analysis which examines risk factors associated with hospital utilization and evaluates key hospital utilization metrics. We have also highlighted the modular structure of the framework that can be generalized for any other domains beyond healthcare. In the framework, an online data update method is provided which can be extended further using both real and synthetic data. As such, the framework can be especially useful for sparse dataset domains such as environmental science, robotics, cybersecurity, and cultural heritage and arts.