🤖 AI Summary
Cardiac rehabilitation research is hindered by the scarcity, high missingness, and heterogeneity of real-world clinical data, limiting the performance of risk prediction models. To address this, we propose the first Conditional Variational Autoencoder (CVAE) framework tailored for cardiac rehabilitation process modeling, integrating clinical temporal feature representation with a missingness-aware enhancement mechanism to generate high-fidelity synthetic data—pathologically coherent, temporally plausible, and compatible across heterogeneous sources. Our method significantly improves downstream model robustness under low-data and high-missingness regimes: multiple risk classifiers achieve an average accuracy gain of 7.2%, outperforming state-of-the-art generative approaches. Moreover, it mitigates dataset bias and enables reliable risk stratification without requiring exercise stress testing.
📝 Abstract
Cardiac rehabilitation constitutes a structured clinical process involving multiple interdependent phases, individualized medical decisions, and the coordinated participation of diverse healthcare professionals. This sequential and adaptive nature enables the program to be modeled as a business process, thereby facilitating its analysis. Nevertheless, studies in this context face significant limitations inherent to real-world medical databases: data are often scarce due to both economic costs and the time required for collection; many existing records are not suitable for specific analytical purposes; and, finally, there is a high prevalence of missing values, as not all patients undergo the same diagnostic tests. To address these limitations, this work proposes an architecture based on a Conditional Variational Autoencoder (CVAE) for the synthesis of realistic clinical records that are coherent with real-world observations. The primary objective is to increase the size and diversity of the available datasets in order to enhance the performance of cardiac risk prediction models and to reduce the need for potentially hazardous diagnostic procedures, such as exercise stress testing. The results demonstrate that the proposed architecture is capable of generating coherent and realistic synthetic data, whose use improves the accuracy of the various classifiers employed for cardiac risk detection, outperforming state-of-the-art deep learning approaches for synthetic data generation.