Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Ultrasound report generation faces challenges including high image variability, strong operator dependence, and difficulties in text standardization, compounded by a lack of high-quality, multi-organ, multilingual annotated datasets. To address these, we propose the first unified framework for multi-organ, bilingual (Chinese–English) ultrasound report generation. Our method integrates a segmented multilingual training paradigm that incorporates structured reporting priors and bilingual alignment, alongside a Vision Transformer (ViT) with selective unfreezing fine-tuning to enhance cross-modal representation alignment. Built upon multimodal large language models and ViT architectures, the framework enables fine-grained textual modeling. Experiments demonstrate significant improvements over the KMVE baseline: BLEU, ROUGE-L, and CIDEr scores increase by approximately 2%, 3%, and 15%, respectively, while false-negative and false-positive rates decrease markedly—achieving clinically deployable accuracy.

Technology Category

Application Category

📝 Abstract
Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveraging the standardized nature of US reports. By aligning modular text fragments with diverse imaging data and curating a bilingual English-Chinese dataset, the method achieves consistent and clinically accurate text generation across organ sites and languages. Fine-tuning with selective unfreezing of the vision transformer (ViT) further improves text-image alignment. Compared to the previous state-of-the-art KMVE method, our approach achieves relative gains of about 2% in BLEU scores, approximately 3% in ROUGE-L, and about 15% in CIDEr, while significantly reducing errors such as missing or incorrect content. By unifying multi-organ and multi-language report generation into a single, scalable framework, this work demonstrates strong potential for real-world clinical workflows.
Problem

Research questions and friction points this paper is trying to address.

Generating standardized ultrasound reports from variable images
Overcoming lack of consistent datasets in ultrasound imaging
Unifying multi-organ and multilingual report generation framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fragment-based multilingual training for standardized reports
Aligning modular text fragments with diverse imaging data
Fine-tuning ViT with selective unfreezing for better alignment
🔎 Similar Papers
No similar papers found.
P
Peixuan Ge
Shenzhen Institutes of Advanced Technology, University of Macau
T
Tongkun Su
Shenzhen Institutes of Advanced Technology
F
Faqin Lv
Chinese PLA General Hospital
Baoliang Zhao
Baoliang Zhao
Shenzhen Institutes of Advanced Technology
P
Peng Zhang
Shenzhen Institutes of Advanced Technology
C
Chi Hong Wong
Shenzhen Institutes of Advanced Technology
L
Liang Yao
Shenzhen Institutes of Advanced Technology
Y
Yu Sun
Shenzhen Institutes of Advanced Technology
Z
Zenan Wang
Shenzhen Institutes of Advanced Technology
P
Pak Kin Wong
University of Macau
Ying Hu
Ying Hu
Professor of Mathematics, Université Rennes
stochastic analysiscontrol and optimizationmathematical finance