LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

๐Ÿ“… 2025-09-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
A lack of comprehensive, multimodal evaluation benchmarks tailored for generative multimodal large language models (MLLMs) in ophthalmology hinders their clinical deployment. To address this, we introduce OphthoBenchโ€”the first large-scale, multimodal ophthalmic benchmark comprising 32,633 samples across 12 ocular diseases and five imaging modalities (including newly added color fundus photography), integrated with multi-granularity tasks: anatomical structure identification, disease screening, international staging standards (e.g., DR, AMD), and demographic bias assessment. Compared to the original LMOD benchmark, OphthoBench expands coverage by nearly 50% and incorporates free-text annotations. We systematically evaluate 24 state-of-the-art MLLMs under zero-shot settings: the best-performing model achieves only 58% accuracy in disease screening, while fine-grained staging tasks remain substantially limited. The dataset, annotation guidelines, and public leaderboard are fully open-sourced, establishing a standardized infrastructure for evaluating generative AI capabilities in ophthalmology.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-threatening eye diseases pose a major global health burden, with timely diagnosis limited by workforce shortages and restricted access to specialized care. While multimodal large language models (MLLMs) show promise for medical image interpretation, advancing MLLMs for ophthalmology is hindered by the lack of comprehensive benchmark datasets suitable for evaluating generative models. We present a large-scale multimodal ophthalmology benchmark comprising 32,633 instances with multi-granular annotations across 12 common ophthalmic conditions and 5 imaging modalities. The dataset integrates imaging, anatomical structures, demographics, and free-text annotations, supporting anatomical structure recognition, disease screening, disease staging, and demographic prediction for bias evaluation. This work extends our preliminary LMOD benchmark with three major enhancements: (1) nearly 50% dataset expansion with substantial enlargement of color fundus photography; (2) broadened task coverage including binary disease diagnosis, multi-class diagnosis, severity classification with international grading standards, and demographic prediction; and (3) systematic evaluation of 24 state-of-the-art MLLMs. Our evaluations reveal both promise and limitations. Top-performing models achieved ~58% accuracy in disease screening under zero-shot settings, and performance remained suboptimal for challenging tasks like disease staging. We will publicly release the dataset, curation pipeline, and leaderboard to potentially advance ophthalmic AI applications and reduce the global burden of vision-threatening diseases.
Problem

Research questions and friction points this paper is trying to address.

Addressing global health burden of vision-threatening eye diseases
Overcoming limitations in multimodal large language model evaluation
Providing comprehensive benchmark for ophthalmic AI applications development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large multimodal dataset for ophthalmology MLLMs
Multi-granular annotations across imaging modalities
Systematic evaluation of 24 state-of-the-art models
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zhenyue Qin
School of Medicine, Yale University
Y
Yang Liu
School of Computing, Australian National University
Y
Yu Yin
School of Engineering, Imperial College London
J
Jinyu Ding
School of Medicine, Yale University
H
Haoran Zhang
School of Medicine, Yale University
Anran Li
Anran Li
Yale University
Trustworthy AImedical LLMsfederated learning
Dylan Campbell
Dylan Campbell
Lecturer, Australian National University
RegistrationGlobal optimization3D Reconstruction3D/Stereo Scene Analysis
Xuansheng Wu
Xuansheng Wu
University of Georgia
NLPExplainable AIRecommendation systems
Ke Zou
Ke Zou
Apple, Inc
Power electronicsSwitched-capacitor ConverterPower Semiconductor Devices
Tiarnan D. L. Keenan
Tiarnan D. L. Keenan
Staff Clinician, National Eye Institute, National Institutes of Health
Ophthalmology
E
Emily Y. Chew
National Eye Institute, National Institutes of Health
Zhiyong Lu
Zhiyong Lu
Senior Investigator, NLM; Adjunct Professor of CS, UIUC
BioNLPBiomedical InformaticsMedical AIArtificial Intelligence
Y
Yih-Chung Tham
Yong Loo Lin School of Medicine, National University of Singapore
Ninghao Liu
Ninghao Liu
Assistant Professor, University of Georgia
Explainable AIFairness in Machine LearningGraph MiningAnomaly Detection
Xiuzhen Zhang
Xiuzhen Zhang
Professor of Data Science, RMIT University, Australia
data sciencemachine learningnatural language processingresponsible AI and misinformation
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis