🤖 AI Summary
Ultrasound report generation faces challenges including high image variability, strong operator dependence, and difficulties in text standardization, compounded by a lack of high-quality, multi-organ, multilingual annotated datasets. To address these, we propose the first unified framework for multi-organ, bilingual (Chinese–English) ultrasound report generation. Our method integrates a segmented multilingual training paradigm that incorporates structured reporting priors and bilingual alignment, alongside a Vision Transformer (ViT) with selective unfreezing fine-tuning to enhance cross-modal representation alignment. Built upon multimodal large language models and ViT architectures, the framework enables fine-grained textual modeling. Experiments demonstrate significant improvements over the KMVE baseline: BLEU, ROUGE-L, and CIDEr scores increase by approximately 2%, 3%, and 15%, respectively, while false-negative and false-positive rates decrease markedly—achieving clinically deployable accuracy.
📝 Abstract
Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveraging the standardized nature of US reports. By aligning modular text fragments with diverse imaging data and curating a bilingual English-Chinese dataset, the method achieves consistent and clinically accurate text generation across organ sites and languages. Fine-tuning with selective unfreezing of the vision transformer (ViT) further improves text-image alignment. Compared to the previous state-of-the-art KMVE method, our approach achieves relative gains of about 2% in BLEU scores, approximately 3% in ROUGE-L, and about 15% in CIDEr, while significantly reducing errors such as missing or incorrect content. By unifying multi-organ and multi-language report generation into a single, scalable framework, this work demonstrates strong potential for real-world clinical workflows.