CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

πŸ“… 2026-02-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing multimodal fusion approaches for electrocardiogram (ECG) and clinical text, which often neglect the spatiotemporal dependencies among ECG leads and are susceptible to modality bias from textual inputs, leading to inaccurate diagnostic representations. To overcome these issues, the authors propose a decoupled multimodal ECG representation learning framework that captures fine-grained dynamic features through spatiotemporal masked modeling. The framework integrates contrastive learning with a generative reconstruction mechanism and employs both modality-shared and modality-specific encoders to effectively disentangle modality-invariant and modality-specific information. Extensive experiments on three public datasets demonstrate significant performance gains on downstream tasks, underscoring the method’s advantages in robustness, interpretability, and generalization capability.

Technology Category

Application Category

πŸ“ Abstract
Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality perspective: (1) intra-modality: existing models process ECGs in a lead-agnostic manner, overlooking spatial-temporal dependencies across leads, which restricts their effectiveness in modeling fine-grained diagnostic patterns; (2) inter-modality: existing methods directly align ECG signals with clinical reports, introducing modality-specific biases due to the free-text nature of the reports. In light of these two issues, we propose CG-DMER, a contrastive-generative framework for disentangled multimodal ECG representation learning, powered by two key designs: (1) Spatial-temporal masked modeling is designed to better capture fine-grained temporal dynamics and inter-lead spatial dependencies by applying masking across both spatial and temporal dimensions and reconstructing the missing information. (2) A representation disentanglement and alignment strategy is designed to mitigate unnecessary noise and modality-specific biases by introducing modality-specific and modality-shared encoders, ensuring a clearer separation between modality-invariant and modality-specific representations. Experiments on three public datasets demonstrate that CG-DMER achieves state-of-the-art performance across diverse downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

multimodal ECG
intra-modality
inter-modality
spatial-temporal dependencies
modality-specific bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive-generative learning
disentangled representation
spatiotemporal masked modeling
multimodal ECG
modality alignment
πŸ”Ž Similar Papers
No similar papers found.
Ziwei Niu
Ziwei Niu
Zhejiang University
domain generalizationdomain adaptation
Hao Sun
Hao Sun
Zhejiang University, Ritsumeikan University
Multimodal learningNature Language ProcessingAffective Computing
S
Shujun Bian
Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore
Xihong Yang
Xihong Yang
NUDT & NUS
Graph Neural NetworkRecommender SystemMulti-modal/view Learning
L
Lanfen Lin
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Y
Yuxin Liu
Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore
Yueming Jin
Yueming Jin
Assistant Professor, National University of Singapore
Medical Image AnalysisSurgical AI&RoboticsMultimodal Learning