🤖 AI Summary
This study addresses the clinical challenges of time-consuming manual left ventricular ejection fraction (LVEF) assessment and substantial inter-observer variability in echocardiography. We propose a video-based deep learning framework for automatic LVEF estimation. Three architectural paradigms—3D Inception, two-stream networks, and CNN-RNN hybrids—are systematically compared. Furthermore, we conduct an in-depth analysis of how model capacity and critical hyperparameters—including kernel size and normalization strategy—affect generalization performance. Evaluated on the EchoNet-Dynamic dataset (10,030 echocardiographic videos), our lightweight, optimized 3D Inception variant achieves a state-of-the-art root mean square error (RMSE) of 6.79% for video-level LVEF regression. Results demonstrate that structural simplification combined with meticulous hyperparameter tuning effectively mitigates overfitting and enhances clinical deployability.
📝 Abstract
Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and plays a central role in the diagnosis and management of cardiovascular disease. Echocardiography, as a readily accessible and non-invasive imaging modality, is widely used in clinical practice to estimate LVEF. However, manual assessment of cardiac function from echocardiograms is time-consuming and subject to considerable inter-observer variability. Deep learning approaches offer a promising alternative, with the potential to achieve performance comparable to that of experienced human experts. In this study, we investigate the effectiveness of several deep learning architectures for LVEF estimation from echocardiography videos, including 3D Inception, two-stream, and CNN-RNN models. We systematically evaluate architectural modifications and fusion strategies to identify configurations that maximize prediction accuracy. Models were trained and evaluated on the EchoNet-Dynamic dataset, comprising 10,030 echocardiogram videos. Our results demonstrate that modified 3D Inception architectures achieve the best overall performance, with a root mean squared error (RMSE) of 6.79%. Across architectures, we observe a tendency toward overfitting, with smaller and simpler models generally exhibiting improved generalization. Model performance was also found to be highly sensitive to hyperparameter choices, particularly convolutional kernel sizes and normalization strategies. While this study focuses on echocardiography-based LVEF estimation, the insights gained regarding architectural design and training strategies may be applicable to a broader range of medical and non-medical video analysis tasks.