Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the clinical need for automated generation of structured radiology reports from medical images, this paper proposes a multimodal framework that freezes a large language model (e.g., LLaMA) and couples it with a trainable Vision Transformer (ViT) visual encoder. The core contribution is the first-ever vision-feature-driven dynamic instance-level prompt customization mechanism, realized via two paradigms—prompt-wise and promptbook-wise—employing conditional affine transformations to generate visual-conditioned prompts, thereby overcoming limitations of static prompting and end-to-end fine-tuning. Additionally, a multi-stage contrastive alignment training strategy is introduced to enhance cross-modal semantic consistency. Evaluated on IU X-ray and MIMIC-CXR, the method achieves state-of-the-art performance: +2.3% BLEU-4 and +4.1% CIDEr over prior work, with significant improvements in clinical accuracy and anatomical consistency of generated reports.

Technology Category

Application Category

📝 Abstract
Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

Generating medical reports from imaging data is challenging
Integrating large language models with medical imaging needs exploration
Customized prompts for precise medical report generation are required
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines frozen LLM with learnable visual encoder
Uses dynamic prompt customization mechanism
Generates instance-specific prompts via affine transformations
🔎 Similar Papers
No similar papers found.