🤖 AI Summary
The ophthalmology domain lacks a real-world, clinically grounded bilingual (Chinese–English) multimodal visual question answering (VQA) benchmark.
Method: We introduce OphthalWeChat—the first such benchmark—constructed from authentic ophthalmic images and textual content sourced from WeChat official accounts. It spans nine subspecialties, 548 diseases, and 29 imaging modalities. We propose a multi-examination collaborative, hierarchical QA evaluation framework supporting binary, multiple-choice, and open-ended questions, enabling fine-grained assessment across models, languages, and tasks. High-quality bilingual QA pairs are generated using GPT-4o-mini, validated via clinical structuring and statistical analysis, and evaluated using accuracy, BLEU-1, and BERTScore.
Contribution/Results: OphthalWeChat comprises 3,469 images and 30,120 QA pairs. Experiments show Gemini 2.0 Flash achieves the highest overall accuracy (0.548), while GPT-4o excels in open-ended generation quality.
📝 Abstract
Purpose: To develop a bilingual multimodal visual question answering (VQA) benchmark for evaluating VLMs in ophthalmology. Methods: Ophthalmic image posts and associated captions published between January 1, 2016, and December 31, 2024, were collected from WeChat Official Accounts. Based on these captions, bilingual question-answer (QA) pairs in Chinese and English were generated using GPT-4o-mini. QA pairs were categorized into six subsets by question type and language: binary (Binary_CN, Binary_EN), single-choice (Single-choice_CN, Single-choice_EN), and open-ended (Open-ended_CN, Open-ended_EN). The benchmark was used to evaluate the performance of three VLMs: GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B-Instruct. Results: The final OphthalWeChat dataset included 3,469 images and 30,120 QA pairs across 9 ophthalmic subspecialties, 548 conditions, 29 imaging modalities, and 68 modality combinations. Gemini 2.0 Flash achieved the highest overall accuracy (0.548), outperforming GPT-4o (0.522, P<0.001) and Qwen2.5-VL-72B-Instruct (0.514, P<0.001). It also led in both Chinese (0.546) and English subsets (0.550). Subset-specific performance showed Gemini 2.0 Flash excelled in Binary_CN (0.687), Single-choice_CN (0.666), and Single-choice_EN (0.646), while GPT-4o ranked highest in Binary_EN (0.717), Open-ended_CN (BLEU-1: 0.301; BERTScore: 0.382), and Open-ended_EN (BLEU-1: 0.183; BERTScore: 0.240). Conclusions: This study presents the first bilingual VQA benchmark for ophthalmology, distinguished by its real-world context and inclusion of multiple examinations per patient. The dataset reflects authentic clinical decision-making scenarios and enables quantitative evaluation of VLMs, supporting the development of accurate, specialized, and trustworthy AI systems for eye care.