BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation benchmarks for ophthalmology-specific large language models (LLMs) suffer from narrow coverage and an overreliance on accuracy metrics, lacking systematic assessment of clinical reasoning capabilities. Method: We introduce BELO—the first standardized, multi-round expert-validated benchmark for ophthalmology—comprising 900 high-quality multiple-choice questions synthesized from five authoritative data sources. Questions undergo keyword-based filtering and PubMedBERT fine-tuning, followed by three rounds of rigorous review by ophthalmology specialists to ensure domain fidelity and evaluation fairness. Our hybrid evaluation protocol integrates automatic metrics (accuracy, macro-F1, ROUGE-L, BERTScore, AlignScore) with expert adjudication. Contribution/Results: BELO is publicly released with a dynamic leaderboard. Evaluating six state-of-the-art LLMs reveals significant deficiencies in generating complete, clinically grounded explanations—demonstrating BELO’s effectiveness and reliability in fine-grained performance differentiation.

Technology Category

Application Category

📝 Abstract
Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ's correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO's utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' clinical accuracy in ophthalmology
Assesses reasoning quality of LLMs in ophthalmology
Provides standardized benchmark for ophthalmology knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized expert-checked ophthalmology benchmark BELO
PubMedBERT fine-tuned for ophthalmology MCQ curation
Multi-metric LLM evaluation with human expert review
🔎 Similar Papers
No similar papers found.
S
Sahana Srinivasan
Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
Xuguang Ai
Xuguang Ai
Biomedical Informatics & Data Science, Yale University
AI in HealthcareData ScienceNLPBiomedical Informatics
T
Thaddaeus Wai Soon Lo
Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
Aidan Gilson
Aidan Gilson
Massachusetts Eye and Ear, Harvard Medical School
OphthalmologyMachine LearningArtificial Intelligence
M
Minjie Zou
Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
Ke Zou
Ke Zou
Apple, Inc
Power electronicsSwitched-capacitor ConverterPower Semiconductor Devices
Hyunjae Kim
Hyunjae Kim
Yale University
Natural Language ProcessingBiomedical InformaticsHealthcare
Mingjia Yang
Mingjia Yang
University of Michigan
Krithi Pushpanathan
Krithi Pushpanathan
Research Associate, National University of Singapore
ophthalmologyartificial intelligence
Samantha Yew
Samantha Yew
National University of Singapore
Retinal ImagingDeep LearningCommunity Eye CareOptometryImplementation Science
W
Wan Ting Loke
Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
J
Jocelyn Goh
Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
Y
Yibing Chen
Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
Y
Yiming Kong
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, USA
E
Emily Yuelei Fu
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, USA
M
Michelle Ongyong Hui
Singapore University of Technology and Design, Singapore
K
Kristen Nwanyanwu
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, USA
A
Amisha Dave
Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, USA
K
Kelvin Zhenghao Li
Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
C
Chen-Hsin Sun
Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
Mark Chia
Mark Chia
University College London | Royal Victorian Eye & Ear Hospital
artificial intelligenceophthalmologyepidemiologyIndigenous health
G
Gabriel Dawei Yang
Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
W
Wendy Meihua Wong
Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
D
David Ziyou Chen
Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
Dianbo Liu
Dianbo Liu
Assistant professor, National University of Singapore
Push the limits of humanmachine learningbiomedical sciences