Clinical Validation of Medical-based Large Language Model Chatbots on Ophthalmic Patient Queries with LLM-based Evaluation

๐Ÿ“… 2026-02-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study evaluates the safety and accuracy of small-scale medical large language models (LLMs) in responding to ophthalmology patient inquiries and assesses the reliability of LLM-based automated evaluation methods. Clinicians and GPT-4-Turbo independently rated 2,160 responses from four models across 180 ophthalmic questions using a multidimensional scoring approach grounded in the S.C.O.R.E. framework, supplemented by correlation analysis and kernel density estimation. This work presents the first systematic validation of resource-efficient models for clinical applicability, demonstrating that Meerkat-7B achieves the best performance, while MedLLaMA3-v20 exhibits hallucinatory or misleading content in 25.5% of its responses. Notably, GPT-4-Turboโ€™s evaluations show strong agreement with clinician ratings (Spearman ฯ = 0.80), supporting the feasibility of humanโ€“AI collaborative frameworks for large-scale clinical assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Domain specific large language models are increasingly used to support patient education, triage, and clinical decision making in ophthalmology, making rigorous evaluation essential to ensure safety and accuracy. This study evaluated four small medical LLMs Meerkat-7B, BioMistral-7B, OpenBioLLM-8B, and MedLLaMA3-v20 in answering ophthalmology related patient queries and assessed the feasibility of LLM based evaluation against clinician grading. In this cross sectional study, 180 ophthalmology patient queries were answered by each model, generating 2160 responses. Models were selected for parameter sizes under 10 billion to enable resource efficient deployment. Responses were evaluated by three ophthalmologists of differing seniority and by GPT-4-Turbo using the S.C.O.R.E. framework assessing safety, consensus and context, objectivity, reproducibility, and explainability, with ratings assigned on a five point Likert scale. Agreement between LLM and clinician grading was assessed using Spearman rank correlation, Kendall tau statistics, and kernel density estimate analyses. Meerkat-7B achieved the highest performance with mean scores of 3.44 from Senior Consultants, 4.08 from Consultants, and 4.18 from Residents. MedLLaMA3-v20 performed poorest, with 25.5 percent of responses containing hallucinations or clinically misleading content, including fabricated terminology. GPT-4-Turbo grading showed strong alignment with clinician assessments overall, with Spearman rho of 0.80 and Kendall tau of 0.67, though Senior Consultants graded more conservatively. Overall, medical LLMs demonstrated potential for safe ophthalmic question answering, but gaps remained in clinical depth and consensus, supporting the feasibility of LLM based evaluation for large scale benchmarking and the need for hybrid automated and clinician review frameworks to guide safe clinical deployment.
Problem

Research questions and friction points this paper is trying to address.

clinical validation
medical large language models
ophthalmology
LLM-based evaluation
patient queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

medical LLM evaluation
LLM-based assessment
ophthalmology chatbot
S.C.O.R.E. framework
clinical validation
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Ting Fang Tan
Singapore National Eye Centre, Singapore Eye Research Institute, Singapore
Kabilan Elangovan
Kabilan Elangovan
AI Scientist, SingHealth
Artificial IntelligenceDeep LearningDigital HealthcareGenerative AI
A
Andreas Pollreisz
Singapore National Eye Centre, Singapore Eye Research Institute, Singapore; Department of Ophthalmology and Optometry, Medical University of Vienna, Austria
K
Kevin Bryan Dy
The Hospital at Maayo, Cebu, Philippines
W
Wei Yan Ng
Singapore National Eye Centre, Singapore Eye Research Institute, Singapore
J
Joy Le Yi Wong
Singapore National Eye Centre, Singapore Eye Research Institute, Singapore
L
Liyuan Jin
Singapore National Eye Centre, Singapore Eye Research Institute, Singapore
C
Chrystie Quek Wan Ning
Singapore National Eye Centre, Singapore Eye Research Institute, Singapore
A
Ashley Shuen Ying Hong
Singapore National Eye Centre, Singapore Eye Research Institute, Singapore
A
Arun James Thirunavukarasu
International Centre for Eye Health, London School of Hygiene and Tropical Medicine, London, UK
S
Shelley Yin-His Chang
Department of Ophthalmology, Chang Gung Memorial Hospital, Keelung, Taiwan; College of Medical Science and Technology, Taipei Medical University and National Health Research Institutes, Taipei, Taiwan
Jie Yao
Jie Yao
Duke-NUS Medical College
ophthalmologyrentinal disease
D
Dylan Hong
Centre of AI in Medicine, Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
Zhaoran Wang
Zhaoran Wang
Associate Professor at Northwestern University
Deep Reinforcement LearningData-Driven Decision-MakingOptimization Under Uncertainty
Amrita Gupta
Amrita Gupta
Conservation Science Partners
Computational SustainabilityMachine LearningData ScienceNetwork Optimization
Daniel SW Ting
Daniel SW Ting
Singapore National Eye Center
Retina