π€ AI Summary
In ultrasound medicine, low-level perception (e.g., segmentation, detection) and high-level clinical interpretation (e.g., diagnosis, reasoning) have long remained disjointed. To bridge this gap, we propose the first vision-language unified foundation model tailored for ultrasound. Our method introduces: (1) a lightweight dynamic convolutional mask decoder that generates task-adaptive dynamic kernels conditioned on large language model outputs; (2) task-specific tokens enabling end-to-end joint modeling of segmentation, detection, biometric measurement, and diagnostic reasoning; and (3) a multimodal alignment training paradigm, pretrained and fine-tuned on the large-scale ultrasound dataset UMind-DS. Experiments demonstrate that our model surpasses general-purpose multimodal models across multiple benchmarks and matches or exceeds state-of-the-art task-specific modelsβwhile exhibiting strong generalization and clinical applicability.
π Abstract
Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.