Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses key challenges in automated diabetic retinopathy (DR) grading—such as subtle lesion appearance, domain shift across imaging devices, and lack of clinical semantic context—by proposing a novel approach that integrates vision-language pretraining with diffusion probabilistic modeling. The method leverages a domain-adapted CLIP model to extract rich semantic features, employs lightweight LoRA modules for cross-domain alignment, and constructs an image-text dot-product-based cross-modal conditioning vector to guide a diffusion denoising network toward precise severity grading. Notably, this is the first study to incorporate CLIP-derived semantic conditioning into a diffusion model for DR grading, eschewing complex dual-branch architectures. Evaluated on the APTOS 2019 dataset, the approach achieves an accuracy of 87.5% and a macro-averaged F1 score of 0.731, outperforming current state-of-the-art methods.

📝 Abstract

Automated grading of diabetic retinopathy (DR) faces several critical challenges: subtle inter-grade visual distinctions in fine-grained lesion patterns, distributional discrepancies induced by heterogeneous imaging devices and acquisition conditions, and the inherent inability of purely visual approaches to exploit clinical semantic knowledge. In this paper, we propose CLIP-Guided Semantic Diffusion (CGSD), a DR grading framework that synergistically integrates vision-language pretraining with diffusion probabilistic modeling. We adopt a domain-specific vision-language model tailored for DR grading as the semantic guidance module and adapt it to the target domain via Low-Rank Adaptation (LoRA), effectively bridging the distributional gap between the pretrained model and the target dataset with only a minimal number of trainable parameters. Building on this foundation, we construct a cross-modal semantic conditioning vector by computing the dot product between image features and the text description features of each DR grade, yielding a joint representation that simultaneously encodes visual content and clinical-grade semantics. This vector serves as the conditioning signal for the diffusion denoising network, replacing the structurally complex dual-branch visual prior employed in existing diffusion-based classification methods. Experiments on the APTOS 2019 dataset demonstrate that the proposed approach achieves an accuracy of 87.5% and a macro-averaged F1 score of 0.731, outperforming a variety of representative methods. Ablation studies further validate the independent contribution of each constituent module.

Problem

Research questions and friction points this paper is trying to address.

diabetic retinopathy grading

cross-modal semantics

distributional discrepancy

fine-grained lesion patterns

clinical semantic knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal semantic conditioning

diffusion probabilistic modeling

vision-language pretraining