Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

πŸ“… 2025-06-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In medical ultrasound image analysis, manual region-of-interest (ROI) delineation is time-consuming, highly subjective, and suffers from poor inter-observer reproducibility; meanwhile, existing vision-language foundation models exhibit limited performance due to substantial domain gaps between natural images and ultrasound data. To address these challenges, we propose the first ultrasound-specific vision-language foundation model adaptation framework. Our approach innovatively integrates a large language model as a text refiner, synergistically coupled with an ultrasound-domain-specific adapter and a multi-task head jointly optimized for segmentation and classification. Extensive experiments across six public ultrasound datasets demonstrate that our method consistently outperforms state-of-the-art vision-language and pure-vision baselines, achieving significant improvements in both segmentation accuracy and diagnostic classification performance. The source code is publicly available.

Technology Category

Application Category

πŸ“ Abstract
Medical ultrasonography is an essential imaging technique for examining superficial organs and tissues, including lymph nodes, breast, and thyroid. It employs high-frequency ultrasound waves to generate detailed images of the internal structures of the human body. However, manually contouring regions of interest in these images is a labor-intensive task that demands expertise and often results in inconsistent interpretations among individuals. Vision-language foundation models, which have excelled in various computer vision applications, present new opportunities for enhancing ultrasound image analysis. Yet, their performance is hindered by the significant differences between natural and medical imaging domains. This research seeks to overcome these challenges by developing domain adaptation methods for vision-language foundation models. In this study, we explore the fine-tuning pipeline for vision-language foundation models by utilizing large language model as text refiner with special-designed adaptation strategies and task-driven heads. Our approach has been extensively evaluated on six ultrasound datasets and two tasks: segmentation and classification. The experimental results show that our method can effectively improve the performance of vision-language foundation models for ultrasound image analysis, and outperform the existing state-of-the-art vision-language and pure foundation models. The source code of this study is available at href{https://github.com/jinggqu/NextGen-UIA}{GitHub}.
Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models for medical ultrasound analysis
Reducing manual labor in ultrasound image contouring
Bridging domain gaps between natural and medical images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain adaptation for vision-language models
Fine-tuning with large language model
Task-driven heads for ultrasound analysis
πŸ”Ž Similar Papers
No similar papers found.
J
Jingguo Qu
Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong, China
Xinyang Han
Xinyang Han
Southern University of Science and Technology
Robot controlEmbedded system
T
Tonghuan Xiao
Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong, China
J
Jia Ai
Suzhou Hospital of Traditional Chinese Medicine Affiliated to Nanjing University of Chinese Medicine, Suzhou, China
J
Juan Wu
Suzhou Hospital of Traditional Chinese Medicine Affiliated to Nanjing University of Chinese Medicine, Suzhou, China
T
Tong Zhao
Department of Ultrasound, The Affiliated Changzhou No. 2 People’s Hospital of Nanjing Medical University, Changzhou, China
Jing Qin
Jing Qin
University of Southern Denmark
MathematicsStatistics
A
Ann Dorothy King
Department of Imaging and Interventional Radiology, The Chinese University of Hong Kong, Hong Kong, China
W
Winnie Chiu-Wing Chu
Department of Imaging and Interventional Radiology, The Chinese University of Hong Kong, Hong Kong, China
J
Jing Cai
Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong, China
M
Michael Tin-Cheung Ying
Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong, China