Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study addresses the trade-off between accuracy and deployment cost when applying large language models (LLMs) to ophthalmic question answering. We propose an automated “LLM-as-a-judge” evaluation framework anchored on multiple human references, integrating Bradley–Terry pairwise ranking with token-level cost modeling to systematically characterize the accuracy–cost Pareto frontier of the GPT-5 series—including GPT-5-high and GPT-5-mini-low—on closed-book multiple-choice ophthalmology exams. Results show GPT-5-high achieves 96.5% accuracy, significantly outperforming GPT-4o and established baselines; GPT-5-mini-low delivers optimal cost-efficiency. To our knowledge, this is the first scalable, reproducible, domain-specific LLM evaluation paradigm for ophthalmology, enabling principled model selection and deployment of clinical AI systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI's GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano variants (P < .001), o1-high (P = .04), and GPT-4o (P < .001), but not o3-high (0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high). Cost-accuracy analysis identified several GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low offering the most favorable low-cost, high-performance balance. These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of LLM-generated answers against reference standards in ophthalmology.

Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-5 configurations for medical QA accuracy

Assessing cost-efficiency trade-offs in GPT-5 models

Benchmarking LLM performance in ophthalmology question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated 12 GPT-5 configurations for accuracy and cost-efficiency

Used LLM-as-a-judge framework for rationale quality assessment

Identified Pareto-optimal GPT-5 settings for performance-cost balance

🔎 Similar Papers

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge