Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The ophthalmology domain lacks a real-world, clinically grounded bilingual (Chinese–English) multimodal visual question answering (VQA) benchmark. Method: We introduce OphthalWeChat—the first such benchmark—constructed from authentic ophthalmic images and textual content sourced from WeChat official accounts. It spans nine subspecialties, 548 diseases, and 29 imaging modalities. We propose a multi-examination collaborative, hierarchical QA evaluation framework supporting binary, multiple-choice, and open-ended questions, enabling fine-grained assessment across models, languages, and tasks. High-quality bilingual QA pairs are generated using GPT-4o-mini, validated via clinical structuring and statistical analysis, and evaluated using accuracy, BLEU-1, and BERTScore. Contribution/Results: OphthalWeChat comprises 3,469 images and 30,120 QA pairs. Experiments show Gemini 2.0 Flash achieves the highest overall accuracy (0.548), while GPT-4o excels in open-ended generation quality.

Technology Category

Application Category

📝 Abstract
Purpose: To develop a bilingual multimodal visual question answering (VQA) benchmark for evaluating VLMs in ophthalmology. Methods: Ophthalmic image posts and associated captions published between January 1, 2016, and December 31, 2024, were collected from WeChat Official Accounts. Based on these captions, bilingual question-answer (QA) pairs in Chinese and English were generated using GPT-4o-mini. QA pairs were categorized into six subsets by question type and language: binary (Binary_CN, Binary_EN), single-choice (Single-choice_CN, Single-choice_EN), and open-ended (Open-ended_CN, Open-ended_EN). The benchmark was used to evaluate the performance of three VLMs: GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B-Instruct. Results: The final OphthalWeChat dataset included 3,469 images and 30,120 QA pairs across 9 ophthalmic subspecialties, 548 conditions, 29 imaging modalities, and 68 modality combinations. Gemini 2.0 Flash achieved the highest overall accuracy (0.548), outperforming GPT-4o (0.522, P<0.001) and Qwen2.5-VL-72B-Instruct (0.514, P<0.001). It also led in both Chinese (0.546) and English subsets (0.550). Subset-specific performance showed Gemini 2.0 Flash excelled in Binary_CN (0.687), Single-choice_CN (0.666), and Single-choice_EN (0.646), while GPT-4o ranked highest in Binary_EN (0.717), Open-ended_CN (BLEU-1: 0.301; BERTScore: 0.382), and Open-ended_EN (BLEU-1: 0.183; BERTScore: 0.240). Conclusions: This study presents the first bilingual VQA benchmark for ophthalmology, distinguished by its real-world context and inclusion of multiple examinations per patient. The dataset reflects authentic clinical decision-making scenarios and enables quantitative evaluation of VLMs, supporting the development of accurate, specialized, and trustworthy AI systems for eye care.
Problem

Research questions and friction points this paper is trying to address.

Develop bilingual VQA benchmark for ophthalmology evaluation
Assess VLMs using real-world ophthalmic images and QA pairs
Compare performance of GPT-4o, Gemini 2.0, and Qwen2.5-VL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Used GPT-4o-mini for bilingual QA generation
Collected real-world ophthalmic images from WeChat
Evaluated VLMs with diverse question types
🔎 Similar Papers
No similar papers found.
P
Pusheng Xu
School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong.
X
Xia Gong
School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong.
X
Xiaolan Chen
School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong.
W
Weiyi Zhang
School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong.
J
Jianchen Yang
Swiss Federal Institute of Technology Lausanne (EPFL), Lausanne, Switzerland.
B
Bingjie Yan
School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong.
Meng Yuan
Meng Yuan
Marie Skłodowska-Curie Fellow, Chalmers University of Technology
MechatronicsEnergy systemModel predictive controlRobotics
Yalin Zheng
Yalin Zheng
University of Liverpool
image processingcomputer visionmachine learning and medical image analysis
Mingguang He
Mingguang He
The Hong Kong Polytechnic University
Ophthalmoogy
D
Danli Shi
School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong., Research Centre for SHARP Vision, The Hong Kong Polytechnic University, Kowloon, Hong Kong.