Interpretable Heart Disease Prediction via a Weighted Ensemble Model: A Large-Scale Study with SHAP and Surrogate Decision Trees

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Cardiovascular disease (CVD) risk prediction requires models that simultaneously achieve high accuracy and strong interpretability. To address this, we propose a novel weighted ensemble framework integrating LightGBM, XGBoost, and a one-dimensional convolutional neural network (1D-CNN), augmented by dual-path interpretability enhancement: SHAP-based feature attribution and surrogate decision tree approximation. We further incorporate domain-informed feature engineering and class-balanced weighting to mitigate data imbalance. Evaluated on 229,000 real-world patient records, our model achieves a test AUC of 0.8371 (p = 0.003) and recall of 80.0%, significantly outperforming all individual base models. This work represents the first interpretable deep-learning–tree ensemble for CVD risk prediction, preserving high screening performance while fulfilling clinical requirements for transparency and trustworthiness—enabling scalable deployment in public health settings.

Technology Category

Application Category

📝 Abstract

Cardiovascular disease (CVD) remains a critical global health concern, demanding reliable and interpretable predictive models for early risk assessment. This study presents a large-scale analysis using the Heart Disease Health Indicators Dataset, developing a strategically weighted ensemble model that combines tree-based methods (LightGBM, XGBoost) with a Convolutional Neural Network (CNN) to predict CVD risk. The model was trained on a preprocessed dataset of 229,781 patients where the inherent class imbalance was managed through strategic weighting and feature engineering enhanced the original 22 features to 25. The final ensemble achieves a statistically significant improvement over the best individual model, with a Test AUC of 0.8371 (p=0.003) and is particularly suited for screening with a high recall of 80.0%. To provide transparency and clinical interpretability, surrogate decision trees and SHapley Additive exPlanations (SHAP) are used. The proposed model delivers a combination of robust predictive performance and clinical transparency by blending diverse learning architectures and incorporating explainability through SHAP and surrogate decision trees, making it a strong candidate for real-world deployment in public health screening.

Problem

Research questions and friction points this paper is trying to address.

Developing interpretable cardiovascular disease risk prediction models

Addressing class imbalance in large-scale patient data analysis

Combining ensemble models with explainable AI for clinical transparency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weighted ensemble model combining tree methods with CNN

Using SHAP and surrogate decision trees for interpretability

Handling class imbalance via strategic weighting and feature engineering

🔎 Similar Papers

No similar papers found.