Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

📅 2025-07-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) generate radiology reports with fluent language but insufficient clinical accuracy, primarily due to weak image–text cross-modal alignment and inadequate disease semantic modeling. To address this, we propose an optimal transport (OT)-driven cross-modal alignment framework: (1) OT is employed to explicitly align X-ray image features with fine-grained disease labels, establishing robust visual–semantic correspondences; (2) a dedicated disease prediction module is integrated to enhance pathological awareness; and (3) the resulting aligned representations guide fine-grained instruction tuning of the LLM. Evaluated on MIMIC-CXR and IU X-Ray, our method achieves state-of-the-art performance—improving clinical accuracy (e.g., +5.2% recall for critical findings) and natural language quality (BLEU-4 +2.8). This work is the first to introduce optimal transport into radiology report generation, enabling synergistic optimization of cross-modal alignment and clinical reasoning.

Technology Category

Application Category

📝 Abstract
Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively capture the relationship between X-ray images and their corresponding texts, thus resulting in poor clinical practicability. To address these challenges, we propose Optimal Transport-Driven Radiology Report Generation (OTDRG), a novel framework that leverages Optimal Transport (OT) to align image features with disease labels extracted from reports, effectively bridging the cross-modal gap. The core component of OTDRG is Alignment & Fine-Tuning, where OT utilizes results from the encoding of label features and image visual features to minimize cross-modal distances, then integrating image and text features for LLMs fine-tuning. Additionally, we design a novel disease prediction module to predict disease labels contained in X-ray images during validation and testing. Evaluated on the MIMIC-CXR and IU X-Ray datasets, OTDRG achieves state-of-the-art performance in both natural language generation (NLG) and clinical efficacy (CE) metrics, delivering reports that are not only linguistically coherent but also clinically accurate.
Problem

Research questions and friction points this paper is trying to address.

Aligns X-ray images with clinical text using Optimal Transport
Improves clinical accuracy of radiology reports via LLMs
Bridges vision-language gap for better diagnostic report generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Optimal Transport to align image-text features
Integrates image and text features for LLM fine-tuning
Includes disease prediction module for clinical accuracy
🔎 Similar Papers
No similar papers found.
H
Haifeng Zhao
School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
Y
Yufei Zhang
School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
L
Leilei Ma
School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
S
Shuo Xu
School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
Dengdi Sun
Dengdi Sun
Anhui University
Machine LearningComputer Vision