🤖 AI Summary
Prior to this work, no systematic evaluation existed of OpenAI’s o1 model’s reasoning capabilities and domain-specific adaptability in ophthalmology. Method: We benchmarked o1 against five state-of-the-art LLMs—including GPT-4o—on 6,990 ophthalmology-specific questions from MedMCQA, covering subspecialties such as glaucoma and lens disorders. Evaluation employed accuracy, macro-F1, and text-based reasoning metrics (chain-of-thought coverage, logical coherence). Contribution/Results: o1 achieved the highest overall accuracy (88.0%) and macro-F1, ranking first globally but third in most subspecialties. Notably, it excelled on long-explanation questions, attaining top integrated performance in Lens and Glaucoma categories. This study reveals o1’s “strong generalization, weak specialization” profile in clinical domains, establishing a novel methodology and benchmark for evaluating LLMs’ clinical readiness.
📝 Abstract
Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in ``Lens'' and ``Glaucoma'' but second to GPT-4o in ``Corneal and External Diseases'', ``Vitreous and Retina'' and ``Oculoplastic and Orbital Diseases''. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1's reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology.