🤖 AI Summary
Cardiovascular disease (CVD) risk prediction requires models that simultaneously achieve high accuracy and strong interpretability. To address this, we propose a novel weighted ensemble framework integrating LightGBM, XGBoost, and a one-dimensional convolutional neural network (1D-CNN), augmented by dual-path interpretability enhancement: SHAP-based feature attribution and surrogate decision tree approximation. We further incorporate domain-informed feature engineering and class-balanced weighting to mitigate data imbalance. Evaluated on 229,000 real-world patient records, our model achieves a test AUC of 0.8371 (p = 0.003) and recall of 80.0%, significantly outperforming all individual base models. This work represents the first interpretable deep-learning–tree ensemble for CVD risk prediction, preserving high screening performance while fulfilling clinical requirements for transparency and trustworthiness—enabling scalable deployment in public health settings.
📝 Abstract
Cardiovascular disease (CVD) remains a critical global health concern, demanding reliable and interpretable predictive models for early risk assessment. This study presents a large-scale analysis using the Heart Disease Health Indicators Dataset, developing a strategically weighted ensemble model that combines tree-based methods (LightGBM, XGBoost) with a Convolutional Neural Network (CNN) to predict CVD risk. The model was trained on a preprocessed dataset of 229,781 patients where the inherent class imbalance was managed through strategic weighting and feature engineering enhanced the original 22 features to 25. The final ensemble achieves a statistically significant improvement over the best individual model, with a Test AUC of 0.8371 (p=0.003) and is particularly suited for screening with a high recall of 80.0%. To provide transparency and clinical interpretability, surrogate decision trees and SHapley Additive exPlanations (SHAP) are used. The proposed model delivers a combination of robust predictive performance and clinical transparency by blending diverse learning architectures and incorporating explainability through SHAP and surrogate decision trees, making it a strong candidate for real-world deployment in public health screening.