RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current general-purpose multimodal large language models (MLLMs) exhibit significant limitations in quantitative retinal image analysis, lesion localization, and interpretable diagnosis, failing to meet clinical ophthalmology requirements for precise interpretation. To address this, we propose the first retinal-care–oriented multimodal dialogue assistant. Our method introduces a novel retinal-domain–specific visual instruction-tuning paradigm and an end-to-end interpretable clinical reasoning framework, built upon a large vision-language model. It integrates a high-quality, self-curated retinal image dataset, a tailored data curation pipeline, and domain-adapted fine-tuning strategies. Evaluated across eight retinal disease benchmarks, our model consistently outperforms general MLLMs, enabling fine-grained lesion localization and quantitative medical reasoning. The code and dataset are publicly released to foster reproducible research and clinical translation.

Technology Category

Application Category

📝 Abstract

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce extit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT

Problem

Research questions and friction points this paper is trying to address.

Addresses insufficient retinal image understanding in general-domain MLLMs.

Develops RetinalGPT for precise retinal disease diagnosis and analysis.

Enhances interpretable clinical research with quantitative and lesion localization features.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Customized visual instruction tuning for retinal analysis

Large retinal image dataset compilation for training

End-to-end clinical research framework with quantitative analyses

🔎 Similar Papers

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge