Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the trade-off between accuracy and deployment cost when applying large language models (LLMs) to ophthalmic question answering. We propose an automated “LLM-as-a-judge” evaluation framework anchored on multiple human references, integrating Bradley–Terry pairwise ranking with token-level cost modeling to systematically characterize the accuracy–cost Pareto frontier of the GPT-5 series—including GPT-5-high and GPT-5-mini-low—on closed-book multiple-choice ophthalmology exams. Results show GPT-5-high achieves 96.5% accuracy, significantly outperforming GPT-4o and established baselines; GPT-5-mini-low delivers optimal cost-efficiency. To our knowledge, this is the first scalable, reproducible, domain-specific LLM evaluation paradigm for ophthalmology, enabling principled model selection and deployment of clinical AI systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI's GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano variants (P < .001), o1-high (P = .04), and GPT-4o (P < .001), but not o3-high (0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high). Cost-accuracy analysis identified several GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low offering the most favorable low-cost, high-performance balance. These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of LLM-generated answers against reference standards in ophthalmology.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-5 configurations for medical QA accuracy
Assessing cost-efficiency trade-offs in GPT-5 models
Benchmarking LLM performance in ophthalmology question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated 12 GPT-5 configurations for accuracy and cost-efficiency
Used LLM-as-a-judge framework for rationale quality assessment
Identified Pareto-optimal GPT-5 settings for performance-cost balance
🔎 Similar Papers
No similar papers found.
Fares Antaki
Fares Antaki
Cleveland Clinic Cole Eye Institute
OphthalmologyRetinaVitreoretinal surgeryArtificial intelligenceLarge language models
David Mikhail
David Mikhail
University of Toronto
SurgeryMedicineArtificial Intelligence
Daniel Milad
Daniel Milad
Université de Montréal, Department of Ophthalmology
OphthalmologyArtificial IntelligenceMachine LearningAutomated Machine Learning
D
Danny A Mammo
Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA
Sumit Sharma
Sumit Sharma
Assistant Professor of Computer Science Engineering, Chandigarh University
SmartphoneMalwareBugsHealthcareTransportation
S
Sunil K Srivastava
Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA
Bing Yu Chen
Bing Yu Chen
Cleveland Clinic
Large language models in vascular neurology
Samir Touma
Samir Touma
Ophthalmology Resident, University of Montreal
Mertcan Sevgi
Mertcan Sevgi
Clinical Research Fellow in Artificial Intelligence, UCL Institute of Ophthalmology
AIGlobal Health
J
Jonathan El-Khoury
Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada; Department of Ophthalmology, Hopital Maisonneuve-Rosemont, Montreal, Quebec, Canada
P
Pearse A Keane
Institute of Ophthalmology, University College London, London, UK; NIHR Biomedical Research Centre at Moorfields, Moorfields Eye Hospital NHS Foundation Trust, London, UK
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis
Yih Chung Tham
Yih Chung Tham
Yong Loo Lin School of Medicine, National University of Singapore; Singapore Eye Research Institute
OphthalmologyEpidemiologyVisual ImpairmentDeep Learning
Renaud Duval
Renaud Duval
Assistant Professor of Ophthalmology, Université de Montréal
OphthalmologyRetinaSurgeryMachine Learning