Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation Study

📅 2025-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior to this work, no systematic evaluation existed of OpenAI’s o1 model’s reasoning capabilities and domain-specific adaptability in ophthalmology. Method: We benchmarked o1 against five state-of-the-art LLMs—including GPT-4o—on 6,990 ophthalmology-specific questions from MedMCQA, covering subspecialties such as glaucoma and lens disorders. Evaluation employed accuracy, macro-F1, and text-based reasoning metrics (chain-of-thought coverage, logical coherence). Contribution/Results: o1 achieved the highest overall accuracy (88.0%) and macro-F1, ranking first globally but third in most subspecialties. Notably, it excelled on long-explanation questions, attaining top integrated performance in Lens and Glaucoma categories. This study reveals o1’s “strong generalization, weak specialization” profile in clinical domains, establishing a novel methodology and benchmark for evaluating LLMs’ clinical readiness.

Technology Category

Application Category

📝 Abstract
Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in ``Lens'' and ``Glaucoma'' but second to GPT-4o in ``Corneal and External Diseases'', ``Vitreous and Retina'' and ``Oculoplastic and Orbital Diseases''. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1's reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology.
Problem

Research questions and friction points this paper is trying to address.

OpenAI o1 Model Evaluation
Ophthalmology
Large Language Model Comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenAI o1 Model
Ophthalmology
Performance Evaluation
S
Sahana Srinivasan
Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
Xuguang Ai
Xuguang Ai
Biomedical Informatics & Data Science, Yale University
AI in HealthcareData ScienceNLPBiomedical Informatics
M
Minjie Zou
Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
Ke Zou
Ke Zou
Apple, Inc
Power electronicsSwitched-capacitor ConverterPower Semiconductor Devices
Hyunjae Kim
Hyunjae Kim
Yale University
Natural Language ProcessingBiomedical InformaticsHealthcare
T
Thaddaeus Wai Soon Lo
Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
Krithi Pushpanathan
Krithi Pushpanathan
Research Associate, National University of Singapore
ophthalmologyartificial intelligence
Y
Yiming Kong
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, USA
Anran Li
Anran Li
Yale University
Trustworthy AImedical LLMsfederated learning
M
Maxwell Singer
Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, USA
K
Kai Jin
Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
F
F. Antaki
Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA; The CHUM School of Artificial Intelligence in Healthcare, Montreal, QC, Canada
D
David Ziyou Chen
Department of Ophthalmology, National University Hospital, Singapore
Dianbo Liu
Dianbo Liu
Assistant professor, National University of Singapore
Push the limits of humanmachine learningbiomedical sciences
R
Ron A. Adelman
Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, USA
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis
Yih Chung Tham
Yih Chung Tham
Yong Loo Lin School of Medicine, National University of Singapore; Singapore Eye Research Institute
OphthalmologyEpidemiologyVisual ImpairmentDeep Learning