🤖 AI Summary
This study addresses the challenges of achieving both high accuracy and clinical applicability in five-class ordinal grading of diabetic retinopathy (DR) for large-scale screening. The authors propose three innovative CLIP-based approaches: zero-shot prompt engineering, a hybrid FCN-CLIP model integrating CBAM attention modules, and a novel prompt design incorporating an explicit ordinal-aware mechanism to model the inherent ranking structure of DR severity levels. Experiments on the combined APTOS 2019 and Messidor-2 dataset demonstrate that the ordinal-aware model achieves 93.42% accuracy (AUROC 0.9845) with strong recall for severe cases, while the FCN-CLIP variant attains an AUROC of 0.99 in detecting proliferative DR, substantially outperforming zero-shot baselines. The work provides a systematic evaluation of diverse CLIP adaptation strategies, highlighting their performance characteristics and complementary strengths in medical image grading tasks.
📝 Abstract
Diabetic retinopathy (DR) is a leading cause of preventable blindness, and automated fundus image grading can play an important role in large-scale screening. In this work, we investigate three CLIP-based approaches for five-class DR severity grading: (1) a zero-shot baseline using prompt engineering, (2) a hybrid FCN-CLIP model augmented with CBAM attention, and (3) a ranking-aware prompting model that encodes the ordinal structure of DR progression. We train and evaluate on a combined dataset of APTOS 2019 and Messidor-2 (n=5,406), addressing class imbalance through resampling and class-specific optimal thresholding. Our experiments show that the ranking-aware model achieves the highest overall accuracy (93.42%, AUROC 0.9845) and strong recall on clinically critical severe cases, while the hybrid FCN-CLIP model (92.49%, AUROC 0.99) excels at detecting proliferative DR. Both substantially outperform the zero-shot baseline (55.17%, AUROC 0.75). We analyze the complementary strengths of each approach and discuss their practical implications for screening contexts.