LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Automated colonoscopy polyp reporting suffers from inconsistency and hallucination due to scarcity of high-quality multimodal medical data. Method: We propose the first multimodal report generation framework integrating LoRA-based efficient fine-tuning with clinical preference alignment via Direct Preference Optimization (DPO), built upon Qwen2-VL-7B. We introduce a novel medical image–text alignment mechanism and release MMEndo—the first expert-annotated endoscopic image-text dataset. Contribution/Results: Our model achieves superior performance over all baselines in both automated metrics and clinical expert evaluation (7.2/10). Training costs are reduced by 833× compared to full-parameter fine-tuning. Cross-dataset validation on IU-XRay demonstrates strong generalization and robustness. This work significantly enhances clinical trustworthiness and deployment feasibility of automated colonoscopy reporting.

Technology Category

Application Category

📝 Abstract

Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.

Problem

Research questions and friction points this paper is trying to address.

Generates professional polyp diagnosis reports from colonoscopy images

Reduces hallucinations and inconsistencies in automated medical reporting

Lowers computational costs for fine-tuning multimodal language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient fine-tuning with LoRA

Direct Preference Optimization for clinical alignment

Multimodal dataset curation for endoscopic images

🔎 Similar Papers

MEIT: Multi-Modal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation