LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Current large vision-language models (LVLMs) lack standardized, ophthalmology-specific evaluation benchmarks, hindering their deployment in anatomical understanding, disease diagnosis, and clinical decision support. Method: We introduce LMOD—the first large-scale, multimodal ophthalmic benchmark—comprising five imaging modalities, clinical text, and biomarker data, enabling comprehensive multi-task evaluation. We systematically identify six prevalent failure modes of LVLMs in ophthalmology and propose a fine-grained, specialty-adapted assessment framework. Results: Evaluating 13 state-of-the-art LVLMs on 21,993 samples, we find their ophthalmic performance substantially lags behind general-domain capabilities; in contrast, ophthalmology-specialized supervised models achieve superior accuracy, confirming the critical need for domain adaptation. This work fills a fundamental gap in LVLM evaluation for ophthalmology and advances the development of clinically trustworthy multimodal medical AI.

Technology Category

Application Category

📝 Abstract

The prevalence of vision-threatening eye diseases is a significant global burden, with many cases remaining undiagnosed or diagnosed too late for effective treatment. Large vision-language models (LVLMs) have the potential to assist in understanding anatomical information, diagnosing eye diseases, and drafting interpretations and follow-up plans, thereby reducing the burden on clinicians and improving access to eye care. However, limited benchmarks are available to assess LVLMs' performance in ophthalmology-specific applications. In this study, we introduce LMOD, a large-scale multimodal ophthalmology benchmark consisting of 21,993 instances across (1) five ophthalmic imaging modalities: optical coherence tomography, color fundus photographs, scanning laser ophthalmoscopy, lens photographs, and surgical scenes; (2) free-text, demographic, and disease biomarker information; and (3) primary ophthalmology-specific applications such as anatomical information understanding, disease diagnosis, and subgroup analysis. In addition, we benchmarked 13 state-of-the-art LVLM representatives from closed-source, open-source, and medical domains. The results demonstrate a significant performance drop for LVLMs in ophthalmology compared to other domains. Systematic error analysis further identified six major failure modes: misclassification, failure to abstain, inconsistent reasoning, hallucination, assertions without justification, and lack of domain-specific knowledge. In contrast, supervised neural networks specifically trained on these tasks as baselines demonstrated high accuracy. These findings underscore the pressing need for benchmarks in the development and validation of ophthalmology-specific LVLMs.

Problem

Research questions and friction points this paper is trying to address.

Addressing underdiagnosis of vision-threatening eye diseases globally

Developing benchmarks for ophthalmology-specific large vision-language models

Identifying performance gaps and failure modes in ophthalmology LVLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal ophthalmology benchmark

Thirteen state-of-the-art LVLM representatives

Systematic error analysis identifies failure modes

🔎 Similar Papers

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge