π€ AI Summary
This study addresses the limited clinical deployability of automated diabetic retinopathy (DR) grading models due to poor interpretability. The authors propose a multimodal framework integrating CNNs and Transformers, enhanced by a weighted soft voting ensemble and a hybrid category-level fusion mechanism. Notably, they introduce, for the first time in DR grading, a vision-language model (VLM) guided by controlled prompts to generate clinically coherent textual explanations. Pixel-level visual explanations are provided via Grad-CAM++, while explanation quality is quantitatively assessed using CLIPScore and BERTScore. The approach achieves a quadratic weighted kappa (QWK) of 0.919 with a single model, improving to 0.934β―Β±β―0.017 with ensembling. The VLM-generated explanations attain a clinical coverage rate of 0.700 and an image-text alignment CLIPScore of 0.34, substantially enhancing model trustworthiness and clinical utility.
π Abstract
The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and classifier outputs under conservative prompting constraints. Modern CNN backbones (ResNet-50 and ConvNeXt-Tiny) provided the strongest single-model baselines, with cross-validated QWK up to 0.919 and 0.914, respectively. Ensembling improved ordinal agreement, and weighted soft voting was the most consistent across folds (QWK 0.934 +/- 0.017). Hybrid class-level fusion was competitive but did not yield a statistically reliable improvement over standard fusion in paired fold comparisons (Holm-adjusted p >= 1.000). For explanation quality, Grad-CAM++ offered plausible but coarse localization, and VLM rationales were generally grade-consistent. Quantitatively, VLM variants showed a trade-off between clinical completeness and template-level semantic similarity (coverage 0.700 vs. BERTScore 0.072), while image-text alignment was comparable (CLIPScore approximately 0.34).