GROK: From Quantitative Biomarkers to Qualitative Diagnosis via a Grounded MLLM with Knowledge-Guided Instruction

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical multimodal large language models (e.g., LLaVA-Med) struggle to jointly leverage color fundus photography (CFP) and optical coherence tomography (OCT) images, and exhibit limited capability in clinically interpreting OCT-derived quantitative biomarkers. Method: We propose a “quantitative-to-qualitative” diagnostic chain-of-thought paradigm, integrating CLIP-style cross-modal alignment, knowledge-guided instruction generation, and LoRA-based fine-tuning on the 7B-parameter Qwen2 foundation model to build an ophthalmology-specific multimodal large language model (MLLM) with clinical reasoning capacity. Contribution/Results: The model enables fine-grained lesion localization and interpretable diagnostic reasoning. On our proprietary ophthalmic benchmark, the 7B variant outperforms a 32B baseline model and surpasses OpenAI o3 in both diagnostic report quality and clinical fine-grained evaluation, significantly enhancing CFP–OCT multimodal synergistic interpretation.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) hold promise for integrating diverse data modalities, but current medical adaptations such as LLaVA-Med often fail to fully exploit the synergy between color fundus photography (CFP) and optical coherence tomography (OCT), and offer limited interpretability of quantitative biomarkers. We introduce GROK, a grounded multimodal large language model that jointly processes CFP, OCT, and text to deliver clinician-grade diagnoses of ocular and systemic disease. GROK comprises three core modules: Knowledge-Guided Instruction Generation, CLIP-Style OCT-Biomarker Alignment, and Supervised Instruction Fine-Tuning, which together establish a quantitative-to-qualitative diagnostic chain of thought, mirroring real clinical reasoning when producing detailed lesion annotations. To evaluate our approach, we introduce the Grounded Ophthalmic Understanding benchmark, which covers six disease categories and three tasks: macro-level diagnostic classification, report generation quality, and fine-grained clinical assessment of the generated chain of thought. Experiments show that, with only LoRA (Low-Rank Adaptation) fine-tuning of a 7B-parameter Qwen2 backbone, GROK outperforms comparable 7B and 32B baselines on both report quality and fine-grained clinical metrics, and even exceeds OpenAI o3. Code and data are publicly available in the GROK repository.
Problem

Research questions and friction points this paper is trying to address.

Integrating color fundus photography and OCT for medical diagnosis
Improving interpretability of quantitative biomarkers in ophthalmology
Establishing clinical reasoning chain for ocular disease assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-guided instruction generation for diagnosis
CLIP-style alignment for OCT biomarker quantification
Supervised fine-tuning for clinical reasoning chain
🔎 Similar Papers
No similar papers found.
Z
Zhuangzhi Gao
Department of Primary Care and Mental Health, University of Liverpool, Liverpool, United Kingdom
H
Hongyi Qin
Institute of Life Course & Medical Sciences, University of Liverpool, Liverpool, United Kingdom
H
He Zhao
Department of Eye and Vision Sciences, University of Liverpool, Liverpool, United Kingdom
Qinkai Yu
Qinkai Yu
University of Exeter
Medical Image AnalysisComputer VisionLarge Language Models
F
Feixiang Zhou
Department of Eye and Vision Sciences, University of Liverpool, Liverpool, United Kingdom
E
Eduard Shantsila
Department of Primary Care and Mental Health, University of Liverpool, Liverpool, United Kingdom
Uazman Alam
Uazman Alam
University of Liverpool
DiabetesCVDDiabetic Neuropathy/RetinopathyNeuropathic PainSmall Nerve Fibres
A
Alena Shantsila
Cardiovascular & Metabolic Medicine, University of Liverpool, Liverpool, United Kingdom
W
Wahbi El-Bouri
Cardiovascular & Metabolic Medicine, University of Liverpool, Liverpool, United Kingdom
G
Gregory Y. H. Lip
Liverpool Centre for Cardiovascular Science, University of Liverpool, Liverpool, United Kingdom
Yalin Zheng
Yalin Zheng
University of Liverpool
image processingcomputer visionmachine learning and medical image analysis