🤖 AI Summary
This study addresses the challenge that existing multimodal ECG methods struggle to simultaneously preserve the hierarchical physiological structures—from diagnostic categories to waveform morphology—and clinical semantics inherent in electrocardiogram signals. To this end, the work introduces information theory into ECG representation learning and proposes MERIT, a dual-branch pretraining framework that jointly optimizes structural fidelity and semantic alignment through masked ECG modeling and ECG–text contrastive alignment. Evaluated on benchmarks including PTB-XL, the method significantly outperforms current approaches, achieving over a 5% improvement in SubClass classification F1 score and a 2.66% gain in zero-shot AUC. Furthermore, it enhances downstream ECG-conditioned text generation quality, as evidenced by higher ROUGE and METEOR scores.
📝 Abstract
Electrocardiograms (ECGs) are widely used non-invasive measurements of cardiac activity and play a central role in clinical diagnosis. Recent multimodal approaches align ECG signals with clinical reports to incorporate diagnostic semantics, but clinical reports often fail to preserve the rich physiological structure of ECG waveforms, particularly across multiple levels of abstraction ranging from coarse diagnostic categories to fine-grained morphology. To address this limitation, we formulate ECG representation learning from an information-theoretic perspective and derive a tractable objective that jointly preserves signal structure and integrates clinical semantics. Based on this principle, we propose \textbf{MERIT} (Multimodal ECG Representation via Information Theory), a dual-branch pretraining framework combining masked ECG modeling with ECG--text contrastive alignment. Extensive experiments on PTB-XL and additional benchmarks demonstrate consistent improvements over prior methods, including gains exceeding $3%$ F1 on PTB-XL All and $5%$ F1 on SubClass classification. In zero-shot evaluation, MERIT further improves performance by up to $ +2.66\%$ AUC and $ +2.11\%$ F1 on PTB-XL SubClass, while also demonstrating robustness under multiple distribution-shift settings. Moreover, leveraging the learned ECG representations for ECG-conditioned clinical text generation with large language models improves text quality across several metrics, including ROUGE and METEOR. Together, these results demonstrate that MERIT learns more informative and clinically meaningful ECG representations, particularly for fine-grained clinical applications.