🤖 AI Summary
Current diabetic retinopathy (DR) staging models suffer from poor interpretability, and publicly available datasets typically provide only image-level labels without supporting pathological reasoning. To address this, we propose the first interpretable DR staging framework integrating optical coherence tomography angiography (OCTA) images, biologically inspired graph structures, and vision-language models (VLMs). Specifically, we construct a physiological knowledge graph based on vascular topology, employ graph neural networks (GNNs) to extract structured biomarkers, and apply interpretability-aware distillation coupled with instruction tuning to map pixel-level predictions to clinically comprehensible textual explanations. Our method achieves significant improvements in staging accuracy across multi-source datasets and generates lesion localization and pathological reasoning consistent with expert clinical judgment. Expert evaluation confirms that the generated explanations attain significantly higher accuracy than those of baseline methods.
📝 Abstract
Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely interventions and preventing vision loss. However, current staging models are hardly interpretable, and most public datasets contain no clinical reasoning or interpretation beyond image-level labels. In this paper, we present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis. Our approach leverages optical coherence tomography angiography (OCTA) images by constructing biologically informed graphs that encode key retinal vascular features such as vessel morphology and spatial connectivity. A graph neural network (GNN) then performs DR staging while integrated gradients highlight critical nodes and edges and their individual features that drive the classification decisions. We collect this graph-based knowledge which attributes the model's prediction to physiological structures and their characteristics. We then transform it into textual descriptions for VLMs. We perform instruction-tuning with these textual descriptions and the corresponding image to train a student VLM. This final agent can classify the disease and explain its decision in a human interpretable way solely based on a single image input. Experimental evaluations on both proprietary and public datasets demonstrate that our method not only improves classification accuracy but also offers more clinically interpretable results. An expert study further demonstrates that our method provides more accurate diagnostic explanations and paves the way for precise localization of pathologies in OCTA images.