Clinical Validation of Medical-based Large Language Model Chatbots on Ophthalmic Patient Queries with LLM-based Evaluation

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study evaluates the safety and accuracy of small-scale medical large language models (LLMs) in responding to ophthalmology patient inquiries and assesses the reliability of LLM-based automated evaluation methods. Clinicians and GPT-4-Turbo independently rated 2,160 responses from four models across 180 ophthalmic questions using a multidimensional scoring approach grounded in the S.C.O.R.E. framework, supplemented by correlation analysis and kernel density estimation. This work presents the first systematic validation of resource-efficient models for clinical applicability, demonstrating that Meerkat-7B achieves the best performance, while MedLLaMA3-v20 exhibits hallucinatory or misleading content in 25.5% of its responses. Notably, GPT-4-Turbo’s evaluations show strong agreement with clinician ratings (Spearman ρ = 0.80), supporting the feasibility of human–AI collaborative frameworks for large-scale clinical assessment.

Technology Category

Application Category

📝 Abstract

Domain specific large language models are increasingly used to support patient education, triage, and clinical decision making in ophthalmology, making rigorous evaluation essential to ensure safety and accuracy. This study evaluated four small medical LLMs Meerkat-7B, BioMistral-7B, OpenBioLLM-8B, and MedLLaMA3-v20 in answering ophthalmology related patient queries and assessed the feasibility of LLM based evaluation against clinician grading. In this cross sectional study, 180 ophthalmology patient queries were answered by each model, generating 2160 responses. Models were selected for parameter sizes under 10 billion to enable resource efficient deployment. Responses were evaluated by three ophthalmologists of differing seniority and by GPT-4-Turbo using the S.C.O.R.E. framework assessing safety, consensus and context, objectivity, reproducibility, and explainability, with ratings assigned on a five point Likert scale. Agreement between LLM and clinician grading was assessed using Spearman rank correlation, Kendall tau statistics, and kernel density estimate analyses. Meerkat-7B achieved the highest performance with mean scores of 3.44 from Senior Consultants, 4.08 from Consultants, and 4.18 from Residents. MedLLaMA3-v20 performed poorest, with 25.5 percent of responses containing hallucinations or clinically misleading content, including fabricated terminology. GPT-4-Turbo grading showed strong alignment with clinician assessments overall, with Spearman rho of 0.80 and Kendall tau of 0.67, though Senior Consultants graded more conservatively. Overall, medical LLMs demonstrated potential for safe ophthalmic question answering, but gaps remained in clinical depth and consensus, supporting the feasibility of LLM based evaluation for large scale benchmarking and the need for hybrid automated and clinician review frameworks to guide safe clinical deployment.

Problem

Research questions and friction points this paper is trying to address.

clinical validation

medical large language models

ophthalmology

LLM-based evaluation

patient queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

medical LLM evaluation

LLM-based assessment

ophthalmology chatbot