UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In ultrasound medicine, low-level perception (e.g., segmentation, detection) and high-level clinical interpretation (e.g., diagnosis, reasoning) have long remained disjointed. To bridge this gap, we propose the first vision-language unified foundation model tailored for ultrasound. Our method introduces: (1) a lightweight dynamic convolutional mask decoder that generates task-adaptive dynamic kernels conditioned on large language model outputs; (2) task-specific tokens enabling end-to-end joint modeling of segmentation, detection, biometric measurement, and diagnostic reasoning; and (3) a multimodal alignment training paradigm, pretrained and fine-tuned on the large-scale ultrasound dataset UMind-DS. Experiments demonstrate that our model surpasses general-purpose multimodal models across multiple benchmarks and matches or exceeds state-of-the-art task-specific models—while exhibiting strong generalization and clinical applicability.

Technology Category

Application Category

📝 Abstract

Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.

Problem

Research questions and friction points this paper is trying to address.

Bridging ultrasound segmentation and diagnosis tasks

Unifying grounded perception with clinical reasoning

Creating a comprehensive ultrasound vision-language model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model for ultrasound segmentation, detection, and diagnosis

Large-scale dataset with pixel annotations and clinical rationales

Dynamic convolutional mask decoder conditioned on LLM outputs

🔎 Similar Papers

No similar papers found.

Authors to Follow