HMSViT: A Hierarchical Masked Self-Supervised Vision Transformer for Corneal Nerve Segmentation and Diabetic Neuropathy Diagnosis

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Early diagnosis of diabetic peripheral neuropathy (DPN) remains challenging due to the limitations of current corneal confocal microscopy (CCM) analysis methods—namely, reliance on handcrafted features, scarcity of annotated data, and poor generalizability. To address these issues, we propose a hierarchical masked self-supervised vision Transformer. Our method introduces a pooling-driven hierarchical architecture with dual attention mechanisms and a block-wise masking strategy for multi-scale neural structure modeling under minimal annotation requirements. It further incorporates absolute positional encoding and multi-scale decoder feature fusion to enable end-to-end segmentation and classification. Evaluated on a clinical CCM dataset, our model achieves 61.34% mIoU and 70.40% diagnostic accuracy—outperforming Swin Transformer and HiViT by +6.39% in accuracy—while using fewer parameters and demonstrating superior robustness.

Technology Category

Application Category

📝 Abstract

Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corneal nerve segmentation and DPN diagnosis. Unlike existing methods, HMSViT employs pooling-based hierarchical and dual attention mechanisms with absolute positional encoding, enabling efficient multi-scale feature extraction by capturing fine-grained local details in early layers and integrating global context in deeper layers, all at a lower computational cost. A block-masked self supervised learning framework is designed for the HMSViT that reduces reliance on labelled data, enhancing feature robustness, while a multi-scale decoder is used for segmentation and classification by fusing hierarchical features. Experiments on clinical CCM datasets showed HMSViT achieves state-of-the-art performance, with 61.34% mIoU for nerve segmentation and 70.40% diagnostic accuracy, outperforming leading hierarchical models like the Swin Transformer and HiViT by margins of up to 6.39% in segmentation accuracy while using fewer parameters. Detailed ablation studies further reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training. Overall, these comprehensive experiments confirm that HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications.

Problem

Research questions and friction points this paper is trying to address.

Automated corneal nerve segmentation for DPN diagnosis

Overcoming inefficient feature extraction in CCM analysis

Reducing reliance on labeled data with self-supervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical masked self-supervised Vision Transformer

Pooling-based hierarchical and dual attention mechanisms

Block-masked SSL framework reduces labeled data reliance

🔎 Similar Papers

No similar papers found.