CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limitations of existing multimodal fusion approaches for electrocardiogram (ECG) and clinical text, which often neglect the spatiotemporal dependencies among ECG leads and are susceptible to modality bias from textual inputs, leading to inaccurate diagnostic representations. To overcome these issues, the authors propose a decoupled multimodal ECG representation learning framework that captures fine-grained dynamic features through spatiotemporal masked modeling. The framework integrates contrastive learning with a generative reconstruction mechanism and employs both modality-shared and modality-specific encoders to effectively disentangle modality-invariant and modality-specific information. Extensive experiments on three public datasets demonstrate significant performance gains on downstream tasks, underscoring the method’s advantages in robustness, interpretability, and generalization capability.

Technology Category

Application Category

📝 Abstract

Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality perspective: (1) intra-modality: existing models process ECGs in a lead-agnostic manner, overlooking spatial-temporal dependencies across leads, which restricts their effectiveness in modeling fine-grained diagnostic patterns; (2) inter-modality: existing methods directly align ECG signals with clinical reports, introducing modality-specific biases due to the free-text nature of the reports. In light of these two issues, we propose CG-DMER, a contrastive-generative framework for disentangled multimodal ECG representation learning, powered by two key designs: (1) Spatial-temporal masked modeling is designed to better capture fine-grained temporal dynamics and inter-lead spatial dependencies by applying masking across both spatial and temporal dimensions and reconstructing the missing information. (2) A representation disentanglement and alignment strategy is designed to mitigate unnecessary noise and modality-specific biases by introducing modality-specific and modality-shared encoders, ensuring a clearer separation between modality-invariant and modality-specific representations. Experiments on three public datasets demonstrate that CG-DMER achieves state-of-the-art performance across diverse downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

multimodal ECG

intra-modality

inter-modality

spatial-temporal dependencies

modality-specific bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive-generative learning

disentangled representation

spatiotemporal masked modeling