Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This study addresses the feasibility and limitations of leveraging large language models (LLMs) for structured ophthalmic decision-making—specifically diabetic retinopathy (DR) and glaucoma screening—using only structured textual descriptions derived from retinal fundus photographs, without direct image input. Method: We conducted the first systematic evaluation of GPT-4’s clinical decision simulation capability, generating structured reports including ICDR grading and cup-to-disc ratio estimation via structured prompt engineering. Performance was assessed using accuracy, F1-score, Cohen’s kappa, and McNemar’s test, with ablation on real versus synthetic clinical metadata. Contribution/Results: GPT-4 achieved 67.5% accuracy in five-class DR grading and 82.3% accuracy in binary referral recommendation; however, glaucoma screening performance was markedly inferior. Incorporating clinical metadata yielded no statistically significant improvement. This work empirically validates the potential—and critical constraints—of pure-text LLM-based multimodal辅助 diagnosis in ophthalmology, establishing a methodological foundation for lightweight, interpretable AI-assisted clinical decision support.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored. This study evaluated GPT-4's ability to interpret structured textual descriptions of retinal fundus photographs and simulate clinical decisions for diabetic retinopathy (DR) and glaucoma screening, including the impact of adding real or synthetic clinical metadata. We conducted a retrospective diagnostic validation study using 300 annotated fundus images. GPT-4 received structured prompts describing each image, with or without patient metadata. The model was tasked with assigning an ICDR severity score, recommending DR referral, and estimating the cup-to-disc ratio for glaucoma referral. Performance was evaluated using accuracy, macro and weighted F1 scores, and Cohen's kappa. McNemar's test and change rate analysis were used to assess the influence of metadata. GPT-4 showed moderate performance for ICDR classification (accuracy 67.5%, macro F1 0.33, weighted F1 0.67, kappa 0.25), driven mainly by correct identification of normal cases. Performance improved in the binary DR referral task (accuracy 82.3%, F1 0.54, kappa 0.44). For glaucoma referral, performance was poor across all settings (accuracy ~78%, F1 <0.04, kappa <0.03). Metadata inclusion did not significantly affect outcomes (McNemar p > 0.05), and predictions remained consistent across conditions. GPT-4 can simulate basic ophthalmic decision-making from structured prompts but lacks precision for complex tasks. While not suitable for clinical use, LLMs may assist in education, documentation, or image annotation workflows in ophthalmology.

Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-4's ability to interpret retinal fundus images

Assessing clinical decision-making for diabetic retinopathy and glaucoma

Examining impact of metadata on diagnostic accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated GPT-4 for ophthalmic decision-making using structured prompts

Assessed impact of metadata on diabetic retinopathy and glaucoma screening

Demonstrated LLM potential in education and documentation workflows

🔎 Similar Papers

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models