GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) for electrocardiogram (ECG) interpretation suffer from insufficient cross-modal synergy and a lack of fine-grained alignment between textual diagnoses and underlying waveform evidence. To address these limitations, we propose the first tri-modal LLM integrating ECG time-series signals, 12-lead ECG images, and clinical text. Our method introduces a novel dual-encoder architecture with explicit cross-modal alignment mechanisms and establishes “grounded ECG understanding” as a new task, accompanied by the benchmark ECG-Grounding. We further design knowledge-guided instruction generation and tri-modal fusion strategies to enable verifiable, fine-grained alignment between diagnostic conclusions and waveform features (e.g., QRS duration, PR interval). Experiments demonstrate significant improvements: +7.4% in clinical soundness (CSN), +22.7% in interpretability, and +24.8% in waveform grounding accuracy—substantially enhancing clinical trustworthiness and decision-support capability.

Technology Category

Application Category

📝 Abstract
While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters ($e.g.$, QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN $7.4% uparrow$), explainability ($22.7% uparrow$), and grounding ($24.8% uparrow$), making it more suitable for real-world clinical applications. GitHub repository: https://github.com/lanxiang1017/GEM.git
Problem

Research questions and friction points this paper is trying to address.

Enhances ECG interpretation by integrating time series and visual data.
Improves explainability by linking diagnoses to detailed waveform evidence.
Introduces a benchmark for assessing grounded ECG understanding in MLLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-encoder framework for ECG feature extraction
Cross-modal alignment for multimodal understanding
Knowledge-guided instruction for granular grounding data
🔎 Similar Papers
No similar papers found.
Xiang Lan
Xiang Lan
NC state University
AI4SE
Feng Wu
Feng Wu
National University of Singapore
Mechine LearningMedical Time Series
K
Kai He
Saw Swee Hock School of Public Health and Institute of Data Science, National University of Singapore, Singapore
Qinghao Zhao
Qinghao Zhao
Peking University People's Hospital
Shenda Hong
Shenda Hong
Assistant Professor, Peking University
AI ECGBiosignalAI for Digital HealthHealth Data ScienceAI for Healthcare
M
Mengling Feng
Saw Swee Hock School of Public Health and Institute of Data Science, National University of Singapore, Singapore