EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Ultrasound diagnosis suffers from high operator dependency and subjectivity, while general-purpose multimodal models exhibit poor generalization and low multi-task efficiency in medical applications. To address these challenges, this paper introduces the first vision-language large model specifically designed for ultrasound scenarios. We propose a novel dynamic Mixture-of-Experts (MoE) architecture that jointly models seven anatomical regions, integrating vision-language pretraining, cross-modal alignment, and fine-tuning on large-scale ultrasound image-text data. The resulting model enables unified medical understanding across organs and tasks, supporting ultrasound report generation, diagnostic analysis, and visual question answering. Experimental results demonstrate substantial improvements: on ultrasound report generation, our model achieves +10.15 BLEU-1 and +4.77 ROUGE-1 over Qwen2-VL, confirming its clinical applicability and technical advancement.

Technology Category

Application Category

📝 Abstract
Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited ultrasound knowledge in general vision-language models
Improves multi-organ lesion recognition and diagnostic generalization
Enhances efficiency across ultrasound report generation and VQA tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts architecture for ultrasound imaging
Trained on seven anatomical regions data
Performs report generation diagnosis and VQA
🔎 Similar Papers
No similar papers found.