Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language pretraining models struggle to adapt to ultrasound imaging due to its high anatomical heterogeneity and complex diagnostic attributes. To address this, this work introduces US-365K, a large-scale ultrasound image–text dataset, along with UDT, a standardized ultrasound diagnostic taxonomy. The authors propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that integrates semantic soft labels and a diagnosis-attribute-driven heterogeneous graph neural network to enable structured semantic reasoning. As the first image–text pretraining approach specifically designed for ultrasound, Ultrasound-CLIP achieves state-of-the-art performance on both classification and retrieval tasks, demonstrating exceptional generalization capabilities across zero-shot, linear probing, and fine-tuning evaluation settings.
📝 Abstract
Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.
Problem

Research questions and friction points this paper is trying to address.

ultrasound imaging
vision-language pre-training
image-text understanding
anatomical heterogeneity
diagnostic attributes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ultrasound-CLIP
vision-language pre-training
semantic-aware contrastive learning
ultrasonographic diagnostic taxonomy
heterogeneous graph modality
🔎 Similar Papers
No similar papers found.
J
Jiayun Jin
Hangzhou City University
H
Haolong Chai
Hangzhou City University
X
Xueying Huang
Hangzhou City University
Xiaoqing Guo
Xiaoqing Guo
Assistant Professor in Hong Kong Baptist University, Visiting Fellow in University of Oxford
Medical Image AnalysisUltrasoundComputer Vision
Z
Zengwei Zheng
Hangzhou City University
Z
Zhan Zhou
Zhejiang University
Junmei Wang
Junmei Wang
Professor of computational chemistry/biology, School of Pharmacy, University of Pittsburgh
Computational ChemistryForce Field DevelopmentComputational BiophysicsDrug DesignPharmacometrics and Systems Pharmacolog
X
Xinyu Wang
The First Affiliated Hospital, Zhejiang University School of Medicine
Jie Liu
Jie Liu
City University of Hong Kong
AI4HealthMLLMMedical Imaging
B
Binbin Zhou
Hangzhou City University