EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Echocardiogram interpretation requires manual, multimodal, and guideline-driven quantitative reasoning—capabilities unmet by current vision-language models (VLMs) due to scarce large-scale clinical image-text data and poor alignment with quantitative measurements. Method: We introduce EchoGround-MIMIC, the first measurement-driven, guideline-enhanced multimodal echocardiography dataset, and propose EchoVLM—a novel VLM architecture featuring “measurement anchoring” as a foundational pretraining paradigm. It incorporates view-guided contrastive learning and negation-aware contrastive loss to deeply integrate clinical measurements and guideline logic into representation learning. Contribution/Results: EchoVLM supports diverse downstream tasks—including disease classification, view identification, cardiac chamber segmentation, anatomical landmark detection, and cross-modal image retrieval. In zero-shot disease classification, it achieves 86.5% AUC; view classification accuracy reaches 95.1%. It outperforms all state-of-the-art methods across 36 clinical evaluation metrics, establishing the first foundation model for end-to-end echocardiographic intelligence.

Technology Category

Application Category

📝 Abstract

Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a vision-language model that incorporates two novel pretraining objectives: (i) a view-informed contrastive loss that encodes the view-dependent structure of echocardiographic imaging, and (ii) a negation-aware contrastive loss that distinguishes clinically critical negative from positive findings. Across five types of clinical applications with 36 tasks spanning multimodal disease classification, image-text retrieval, view classification, chamber segmentation, and landmark detection, EchoVLM achieves state-of-the-art performance (86.5% AUC in zero-shot disease classification and 95.1% accuracy in view classification). We demonstrate that clinically grounded multimodal pretraining yields transferable visual representations and establish EchoVLM as a foundation model for end-to-end echocardiography interpretation. We will release EchoGround-MIMIC and the data curation code, enabling reproducibility and further research in multimodal echocardiography interpretation.

Problem

Research questions and friction points this paper is trying to address.

Develops a vision-language model for automated echocardiography interpretation

Addresses lack of measurement-grounded multimodal datasets in echocardiography

Enables clinical reasoning combining images, measurements and guidelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Measurement-grounded multimodal dataset for echocardiography

View-informed and negation-aware contrastive pretraining objectives

State-of-the-art performance in diverse clinical applications

🔎 Similar Papers

No similar papers found.