UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation

πŸ“… 2025-11-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In ultrasound medicine, low-level perception (e.g., segmentation, detection) and high-level clinical interpretation (e.g., diagnosis, reasoning) have long remained disjointed. To bridge this gap, we propose the first vision-language unified foundation model tailored for ultrasound. Our method introduces: (1) a lightweight dynamic convolutional mask decoder that generates task-adaptive dynamic kernels conditioned on large language model outputs; (2) task-specific tokens enabling end-to-end joint modeling of segmentation, detection, biometric measurement, and diagnostic reasoning; and (3) a multimodal alignment training paradigm, pretrained and fine-tuned on the large-scale ultrasound dataset UMind-DS. Experiments demonstrate that our model surpasses general-purpose multimodal models across multiple benchmarks and matches or exceeds state-of-the-art task-specific modelsβ€”while exhibiting strong generalization and clinical applicability.

Technology Category

Application Category

πŸ“ Abstract
Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.
Problem

Research questions and friction points this paper is trying to address.

Bridging ultrasound segmentation and diagnosis tasks
Unifying grounded perception with clinical reasoning
Creating a comprehensive ultrasound vision-language model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model for ultrasound segmentation, detection, and diagnosis
Large-scale dataset with pixel annotations and clinical rationales
Dynamic convolutional mask decoder conditioned on LLM outputs
πŸ”Ž Similar Papers
No similar papers found.
D
Dengbo Chen
Yizhun Medical AI Team
Z
Ziwei Zhao
Yizhun Medical AI Team
Kexin Zhang
Kexin Zhang
Tsinghua University
Data MiningMachine Learning
S
Shishuang Zhao
Yizhun Medical AI Team
J
Junjie Hou
Yizhun Medical AI Team
Y
Yaqian Wang
Yizhun Medical AI Team
N
Nianxi Liao
Yizhun Medical AI Team
A
Anlan Sun
Yizhun Medical AI Team
F
Fei Gao
Yizhun Medical AI Team
J
Jia Ding
Yizhun Medical AI Team
Yuhang Liu
Yuhang Liu
The University of Adelaide
Representation LearningLLMsLatent Variable ModelsResponsible AI
D
Dong Wang
Yizhun Medical AI Team