🤖 AI Summary
This work addresses the limited interpretability of existing ophthalmic visual question answering (VQA) systems, which often fail to explicitly link answers to localized pathological evidence. To bridge this gap, we introduce FundusGround—the first ophthalmic VQA benchmark grounded in spatially localized lesion annotations. Leveraging the standardized ETDRS grid for precise lesion labeling in fundus images, the benchmark automatically generates diverse clinical questions through a three-stage pipeline encompassing fine-grained annotation, spatial grounding, and question synthesis. This framework jointly evaluates model performance in both answer accuracy and lesion-based reasoning. Experimental results demonstrate that incorporating lesion-level visual evidence significantly enhances both model performance and transparency, underscoring the critical role of spatially explicit modeling in developing reliable and interpretable ophthalmic VQA systems.
📝 Abstract
Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions are then generated spanning four formats: open-ended, closed-ended, single-choice, and multiple-choice. We further benchmark multiple general- and medical- large vision-language models using dual metrics for answer accuracy and lesion-level reasoning. The experiments demonstrate that incorporating lesion-level visual evidence consistently improves model performance and transparency, highlighting the necessity of explicit spatial grounding for reliable and explainable ophthalmic VQA.