Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

The ophthalmology domain lacks a real-world, clinically grounded bilingual (Chinese–English) multimodal visual question answering (VQA) benchmark. Method: We introduce OphthalWeChat—the first such benchmark—constructed from authentic ophthalmic images and textual content sourced from WeChat official accounts. It spans nine subspecialties, 548 diseases, and 29 imaging modalities. We propose a multi-examination collaborative, hierarchical QA evaluation framework supporting binary, multiple-choice, and open-ended questions, enabling fine-grained assessment across models, languages, and tasks. High-quality bilingual QA pairs are generated using GPT-4o-mini, validated via clinical structuring and statistical analysis, and evaluated using accuracy, BLEU-1, and BERTScore. Contribution/Results: OphthalWeChat comprises 3,469 images and 30,120 QA pairs. Experiments show Gemini 2.0 Flash achieves the highest overall accuracy (0.548), while GPT-4o excels in open-ended generation quality.

Technology Category

Application Category

📝 Abstract

Purpose: To develop a bilingual multimodal visual question answering (VQA) benchmark for evaluating VLMs in ophthalmology. Methods: Ophthalmic image posts and associated captions published between January 1, 2016, and December 31, 2024, were collected from WeChat Official Accounts. Based on these captions, bilingual question-answer (QA) pairs in Chinese and English were generated using GPT-4o-mini. QA pairs were categorized into six subsets by question type and language: binary (Binary_CN, Binary_EN), single-choice (Single-choice_CN, Single-choice_EN), and open-ended (Open-ended_CN, Open-ended_EN). The benchmark was used to evaluate the performance of three VLMs: GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B-Instruct. Results: The final OphthalWeChat dataset included 3,469 images and 30,120 QA pairs across 9 ophthalmic subspecialties, 548 conditions, 29 imaging modalities, and 68 modality combinations. Gemini 2.0 Flash achieved the highest overall accuracy (0.548), outperforming GPT-4o (0.522, P<0.001) and Qwen2.5-VL-72B-Instruct (0.514, P<0.001). It also led in both Chinese (0.546) and English subsets (0.550). Subset-specific performance showed Gemini 2.0 Flash excelled in Binary_CN (0.687), Single-choice_CN (0.666), and Single-choice_EN (0.646), while GPT-4o ranked highest in Binary_EN (0.717), Open-ended_CN (BLEU-1: 0.301; BERTScore: 0.382), and Open-ended_EN (BLEU-1: 0.183; BERTScore: 0.240). Conclusions: This study presents the first bilingual VQA benchmark for ophthalmology, distinguished by its real-world context and inclusion of multiple examinations per patient. The dataset reflects authentic clinical decision-making scenarios and enables quantitative evaluation of VLMs, supporting the development of accurate, specialized, and trustworthy AI systems for eye care.

Problem

Research questions and friction points this paper is trying to address.

Develop bilingual VQA benchmark for ophthalmology evaluation

Assess VLMs using real-world ophthalmic images and QA pairs

Compare performance of GPT-4o, Gemini 2.0, and Qwen2.5-VL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used GPT-4o-mini for bilingual QA generation

Collected real-world ophthalmic images from WeChat

Evaluated VLMs with diverse question types

🔎 Similar Papers

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models