๐ค AI Summary
Glaucoma diagnosis suffers from high inter-observer variability and fragmented multimodal data, leading to miscalibrated predictive probabilities and overconfident predictions. To address this, we propose V-ViT, a novel calibration paradigm that jointly leverages binocular fundus images and clinical metadata. Our method integrates a Vision Transformer backbone with Monte Carlo Dropout to explicitly model epistemic and aleatoric uncertainty. It introduces binocular collaborative encoding and metadata-conditioned embedding to exploit inter-eye anatomical correlations and contextual clinical priors. Furthermore, we jointly optimize discriminative performance and calibration quality via temperature scaling and Expected Calibration Error (ECE) minimization. Evaluated on a newly curated binocular glaucoma dataset, V-ViT achieves an ECE of 0.012โsetting a new state-of-the-artโand improves classification accuracy by 2.3% over prior methods. Crucially, it substantially mitigates overconfidence, yielding well-calibrated, clinically trustworthy predictions.
๐ Abstract
Glaucoma is an incurable ophthalmic disease that damages the optic nerve, leads to vision loss, and ranks among the leading causes of blindness worldwide. Diagnosing glaucoma typically involves fundus photography, optical coherence tomography (OCT), and visual field testing. However, the high cost of OCT often leads to reliance on fundus photography and visual field testing, both of which exhibit inherent inter-observer variability. This stems from glaucoma being a multifaceted disease that influenced by various factors. As a result, glaucoma diagnosis is highly subjective, emphasizing the necessity of calibration, which aligns predicted probabilities with actual disease likelihood. Proper calibration is essential to prevent overdiagnosis or misdiagnosis, which are critical concerns for high-risk diseases. Although AI has significantly improved diagnostic accuracy, overconfidence in models have worsen calibration performance. Recent study has begun focusing on calibration for glaucoma. Nevertheless, previous study has not fully considered glaucoma's systemic nature and the high subjectivity in its diagnostic process. To overcome these limitations, we propose V-ViT (Voting-based ViT), a novel framework that enhances calibration by incorporating disease-specific characteristics. V-ViT integrates binocular data and metadata, reflecting the multi-faceted nature of glaucoma diagnosis. Additionally, we introduce a MC dropout-based Voting System to address high subjectivity. Our approach achieves state-of-the-art performance across all metrics, including accuracy, demonstrating that our proposed methods are effective in addressing calibration issues. We validate our method using a custom dataset including binocular data.