Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation Study

📅 2025-01-20

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Prior to this work, no systematic evaluation existed of OpenAI’s o1 model’s reasoning capabilities and domain-specific adaptability in ophthalmology. Method: We benchmarked o1 against five state-of-the-art LLMs—including GPT-4o—on 6,990 ophthalmology-specific questions from MedMCQA, covering subspecialties such as glaucoma and lens disorders. Evaluation employed accuracy, macro-F1, and text-based reasoning metrics (chain-of-thought coverage, logical coherence). Contribution/Results: o1 achieved the highest overall accuracy (88.0%) and macro-F1, ranking first globally but third in most subspecialties. Notably, it excelled on long-explanation questions, attaining top integrated performance in Lens and Glaucoma categories. This study reveals o1’s “strong generalization, weak specialization” profile in clinical domains, establishing a novel methodology and benchmark for evaluating LLMs’ clinical readiness.

Technology Category

Application Category

📝 Abstract

Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in ``Lens'' and ``Glaucoma'' but second to GPT-4o in ``Corneal and External Diseases'', ``Vitreous and Retina'' and ``Oculoplastic and Orbital Diseases''. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1's reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology.

Problem

Research questions and friction points this paper is trying to address.

OpenAI o1 Model Evaluation

Ophthalmology

Large Language Model Comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

OpenAI o1 Model

Ophthalmology

Performance Evaluation

🔎 Similar Papers

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge