🤖 AI Summary
In low-resource settings, retinal disease screening faces dual challenges of limited computational capacity on edge devices and scarcity of annotated data. Method: We propose a cross-architecture knowledge distillation framework for fundus image anomaly detection, incorporating a partitioned cross-attention (PCA) module and grouped linear (GL) projector, integrated with multi-view robust training to effectively transfer clinical discriminative knowledge from a teacher model (I-JEPA pre-trained ViT) to a lightweight CNN student model deployable on NVIDIA Jetson Nano. Contribution/Results: The student model retains only 1.03% of the teacher’s parameters while achieving 89% classification accuracy—93% of the teacher’s diagnostic performance—outperforming state-of-the-art distillation methods. This work is the first to combine self-supervised ViTs with structured cross-architecture distillation for fundus screening, empirically validating high-fidelity model compression in real-world edge healthcare deployments.
📝 Abstract
Early and accurate identification of retinal ailments is crucial for averting ocular decline; however, access to dependable diagnostic devices is not often available in low-resourced settings. This project proposes to solve that by developing a lightweight, edge-device deployable disease classifier using cross-architecture knowledge distilling. We first train a high-capacity vision transformer (ViT) teacher model, pre-trained using I-JEPA self-supervised learning, to classify fundus images into four classes: Normal, Diabetic Retinopathy, Glaucoma, and Cataract. We kept an Internet of Things (IoT) focus when compressing to a CNN-based student model for deployment in resource-limited conditions, such as the NVIDIA Jetson Nano. This was accomplished using a novel framework which included a Partitioned Cross-Attention (PCA) projector, a Group-Wise Linear (GL) projector, and a multi-view robust training method. The teacher model has 97.4 percent more parameters than the student model, with it achieving 89 percent classification with a roughly 93 percent retention of the teacher model's diagnostic performance. The retention of clinical classification behavior supports our method's initial aim: compression of the ViT while retaining accuracy. Our work serves as an example of a scalable, AI-driven triage solution for retinal disorders in under-resourced areas.