🤖 AI Summary
Automated colonoscopy polyp reporting suffers from inconsistency and hallucination due to scarcity of high-quality multimodal medical data. Method: We propose the first multimodal report generation framework integrating LoRA-based efficient fine-tuning with clinical preference alignment via Direct Preference Optimization (DPO), built upon Qwen2-VL-7B. We introduce a novel medical image–text alignment mechanism and release MMEndo—the first expert-annotated endoscopic image-text dataset. Contribution/Results: Our model achieves superior performance over all baselines in both automated metrics and clinical expert evaluation (7.2/10). Training costs are reduced by 833× compared to full-parameter fine-tuning. Cross-dataset validation on IU-XRay demonstrates strong generalization and robustness. This work significantly enhances clinical trustworthiness and deployment feasibility of automated colonoscopy reporting.
📝 Abstract
Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.