MedRG: Medical Report Grounding with Multi-modal Large Language Model

📅 2024-04-10
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical phrase localization methods rely on manual keyword extraction from radiology reports, resulting in low efficiency, high clinical burden, and absence of confidence estimation—hindering trustworthy deployment. This work formally defines and addresses the novel “radiology report-to-image region end-to-end localization” task: given a natural language phrase (e.g., “consolidation in the left upper lobe”), directly localize the corresponding anatomical or pathological region in medical images. Methodologically, we introduce a learnable BOX token to unlock open-vocabulary detection capabilities in multimodal large language models (MLLMs); design a unified framework jointly modeling report understanding and visual localization, integrating an MLLM, a vision encoder-decoder, and customized BOX token embeddings; and perform end-to-end joint training. Our approach achieves significant improvements over state-of-the-art methods across multiple medical imaging benchmarks, demonstrating strong efficacy, generalizability, and clinical applicability.

Technology Category

Application Category

📝 Abstract
Medical Report Grounding is pivotal in identifying the most relevant regions in medical images based on a given phrase query, a critical aspect in medical image analysis and radiological diagnosis. However, prevailing visual grounding approaches necessitate the manual extraction of key phrases from medical reports, imposing substantial burdens on both system efficiency and physicians. In this paper, we introduce a novel framework, Medical Report Grounding (MedRG), an end-to-end solution for utilizing a multi-modal Large Language Model to predict key phrase by incorporating a unique token, BOX, into the vocabulary to serve as an embedding for unlocking detection capabilities. Subsequently, the vision encoder-decoder jointly decodes the hidden embedding and the input medical image, generating the corresponding grounding box. The experimental results validate the effectiveness of MedRG, surpassing the performance of the existing state-of-the-art medical phrase grounding methods. This study represents a pioneering exploration of the medical report grounding task, marking the first-ever endeavor in this domain.
Problem

Research questions and friction points this paper is trying to address.

Automates diagnostic phrase extraction from medical reports
Improves grounding accuracy with uncertainty-aware predictions
Enhances clinical trust via end-to-end phrase-box alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end diagnostic phrase and box identification
Multimodal large language model with unique token
Uncertainty-aware prediction for robust grounding
🔎 Similar Papers
No similar papers found.
Ke Zou
Ke Zou
Apple, Inc
Power electronicsSwitched-capacitor ConverterPower Semiconductor Devices
Y
Yang Bai
Institute of High Performance Computing, A*STAR, Singapore
Z
Zhihao Chen
College of Intelligence and Computing, Tianjin University, Tianjin, China
Y
Yang Zhou
Institute of High Performance Computing, A*STAR, Singapore
Y
Yidi Chen
Department of Radiology, West China Hospital, Sichuan University
K
Kai Ren
National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Sichuan, China; College of Computer Science, Sichuan University, Sichuan, China
M
Meng Wang
Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
X
Xuedong Yuan
National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Sichuan, China; College of Computer Science, Sichuan University, Sichuan, China
Xiaojing Shen
Xiaojing Shen
Department of mathematics, Sichuan University
Information fusiontarget trackingapplied statistics
Huazhu Fu
Huazhu Fu
Principal Scientist, IHPC, A*STAR
Medical Image AnalysisAI for HealthcareMedical AITrustworthy AI