🤖 AI Summary
This study addresses the challenges of multi-label electrocardiogram (ECG) classification—namely, disease co-occurrence, class imbalance, and long-range temporal dependencies—and critically examines the necessity of recurrent architectures, which has lacked systematic validation. Through controlled experiments on the PTB-XL dataset, the authors evaluate hybrid models combining CNNs with various recurrent structures (LSTM, GRU, BiLSTM, and their stacked variants). They demonstrate for the first time that increasing recurrent depth yields diminishing returns and risks overfitting. In contrast, a lightweight CNN coupled with a single-layer BiLSTM better aligns with the intrinsic temporal dynamics of ECG signals, consistently outperforming deeper models across key metrics: Hamming loss (0.0338), macro-AUPRC (0.4715), micro-F1 (0.6979), and subset accuracy (0.5723), thereby affirming the efficacy and clinical plausibility of parsimonious temporal modeling.
📝 Abstract
Accurate multi-label classification of electrocardiogram (ECG) signals remains challenging due to the coexistence of multiple cardiac conditions, pronounced class imbalance, and long-range temporal dependencies in multi-lead recordings. Although recent studies increasingly rely on deep and stacked recurrent architectures, the necessity and clinical justification of such architectural complexity have not been rigorously examined. In this work, we perform a systematic comparative evaluation of convolutional neural networks (CNNs) combined with multiple recurrent configurations, including LSTM, GRU, Bidirectional LSTM (BiLSTM), and their stacked variants, for multi-label ECG classification on the PTB-XL dataset comprising 23 diagnostic categories. The CNN component serves as a morphology-driven baseline, while recurrent layers are progressively integrated to assess their contribution to temporal modeling and generalization performance. Experimental results indicate that a CNN integrated with a single BiLSTM layer achieves the most favorable trade-off between predictive performance and model complexity. This configuration attains superior Hamming loss (0.0338), macro-AUPRC (0.4715), micro-F1 score (0.6979), and subset accuracy (0.5723) compared with deeper recurrent combinations. Although stacked recurrent models occasionally improve recall for specific rare classes, our results provide empirical evidence that increasing recurrent depth yields diminishing returns and may degrade generalization due to reduced precision and overfitting. These findings suggest that architectural alignment with the intrinsic temporal structure of ECG signals, rather than increased recurrent depth, is a key determinant of robust performance and clinically relevant deployment.