EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Echocardiogram interpretation requires manual, multimodal, and guideline-driven quantitative reasoning—capabilities unmet by current vision-language models (VLMs) due to scarce large-scale clinical image-text data and poor alignment with quantitative measurements. Method: We introduce EchoGround-MIMIC, the first measurement-driven, guideline-enhanced multimodal echocardiography dataset, and propose EchoVLM—a novel VLM architecture featuring “measurement anchoring” as a foundational pretraining paradigm. It incorporates view-guided contrastive learning and negation-aware contrastive loss to deeply integrate clinical measurements and guideline logic into representation learning. Contribution/Results: EchoVLM supports diverse downstream tasks—including disease classification, view identification, cardiac chamber segmentation, anatomical landmark detection, and cross-modal image retrieval. In zero-shot disease classification, it achieves 86.5% AUC; view classification accuracy reaches 95.1%. It outperforms all state-of-the-art methods across 36 clinical evaluation metrics, establishing the first foundation model for end-to-end echocardiographic intelligence.

Technology Category

Application Category

📝 Abstract
Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.
Problem

Research questions and friction points this paper is trying to address.

Develops a vision-language model for automated echocardiography interpretation
Addresses lack of measurement-grounded multimodal datasets in echocardiography
Enables clinical reasoning combining images, measurements and guidelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Measurement-grounded multimodal dataset for echocardiography
View-informed and negation-aware contrastive pretraining objectives
State-of-the-art performance in diverse clinical applications
🔎 Similar Papers
No similar papers found.
Y
Yuheng Li
Department of Biomedical Engineering, Georgia Institute of Technology, USA
Y
Yue Zhang
Digital Technology & Innovation, Siemens Healthineers, USA
A
Abdoul Aziz Amadou
Siemens Healthcare Limited, Camberley, United Kingdom
Yuxiang Lai
Yuxiang Lai
Ph.D. Student in Computer Science, Emory University
Computer VisionMedical Imaging
Jike Zhong
Jike Zhong
University of Southern California
Computer VisionMachine Learning
Tiziano Passerini
Tiziano Passerini
Siemens Healthineers
Scientific computingBiomechanics
Dorin Comaniciu
Dorin Comaniciu
Siemens Healthineers
Medical Image AnalysisMedical Image ComputingImage-Guided InterventionsArtificial IntelligenceComputer Vision
P
Puneet Sharma
Digital Technology & Innovation, Siemens Healthineers, USA