From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses the limited clinical deployability of automated diabetic retinopathy (DR) grading models due to poor interpretability. The authors propose a multimodal framework integrating CNNs and Transformers, enhanced by a weighted soft voting ensemble and a hybrid category-level fusion mechanism. Notably, they introduce, for the first time in DR grading, a vision-language model (VLM) guided by controlled prompts to generate clinically coherent textual explanations. Pixel-level visual explanations are provided via Grad-CAM++, while explanation quality is quantitatively assessed using CLIPScore and BERTScore. The approach achieves a quadratic weighted kappa (QWK) of 0.919 with a single model, improving to 0.934 ± 0.017 with ensembling. The VLM-generated explanations attain a clinical coverage rate of 0.700 and an image-text alignment CLIPScore of 0.34, substantially enhancing model trustworthiness and clinical utility.

Technology Category

Application Category

📝 Abstract

The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and classifier outputs under conservative prompting constraints. Modern CNN backbones (ResNet-50 and ConvNeXt-Tiny) provided the strongest single-model baselines, with cross-validated QWK up to 0.919 and 0.914, respectively. Ensembling improved ordinal agreement, and weighted soft voting was the most consistent across folds (QWK 0.934 +/- 0.017). Hybrid class-level fusion was competitive but did not yield a statistically reliable improvement over standard fusion in paired fold comparisons (Holm-adjusted p >= 1.000). For explanation quality, Grad-CAM++ offered plausible but coarse localization, and VLM rationales were generally grade-consistent. Quantitatively, VLM variants showed a trade-off between clinical completeness and template-level semantic similarity (coverage 0.700 vs. BERTScore 0.072), while image-text alignment was comparable (CLIPScore approximately 0.34).

Problem

Research questions and friction points this paper is trying to address.

Diabetic Retinopathy

Interpretability

Deep Learning

Clinical Grading

Explainable AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN-Transformer ensemble

visual explainability

vision-language models